Robots.txt - Robots Exclusion Protocol
Robots.txt is a text file that contains instructions for web robots. It allows webmasters to control website access for web robots.
Coc Coc's robots support the robots exclusion standard. This is the same standard adopted by most search engines, though individual search engines may respond to the standard's directives in slightly different ways. This article describes how Coc Coc's robots interpret robots.txt files.
If you want to use the robots exclusion standard for your site:
- 1. Create a text file with the relevant directives described below
- 2. Name it robots.txt
- 3. Upload it to your site's root directory.
Coc Coc's robots request robots.txt files from sites regularly. Before requesting any other URLs from a site, the robot requests the site's robots.txt file using a GET request via either HTTP or HTTPS. Redirects up to 5 hops are supported for this request. If the robot is unable to receive any response to this request, the site is treated as not available and excluded from crawling for a period of time. If the robot receives any response other than 200 OK, it assumes that it has unrestricted access to all documents on the site. If the response is 200 OK, the robot analyzes the returned content, extracts directives from it, and uses those directives until the robot's next request to the robots.txt file.
Directives
User-agent
Every Coc Coc robot has its own name. You can find information about all of our robots here. You can use those names in the User-agent directive to write instructions for a particular robot. Every Coc Coc robot tries to find the User-agent directive that most closely matches its name. All less specific matches are ignored, for example:
# No robots are instructed to not download any documents from '/cgi-bin'. Disallow: /cgi-bin
# All robots, including all of Coc Coc's robots, are instructed to not download any documents from '/cgi-bin'. User-agent: * Disallow: /cgi-bin
# All of Coc Coc's robots are instructed to not download any documents from '/cgi-bin'. # All other robots are still allowed to download all documents from the site. User-agent: * Allow: / User-agent: coccocbot Disallow: /cgi-bin
# coccocbot-web and coccocbot-image are instructed to not download any documents from '/ajax'. # All Coc Coc's other robots are instructed to not download any documents from '/cgi-bin'. # All other robots are still allowed to download all documents from the site. User-agent: * Allow: / User-agent: coccocbot Disallow: /cgi-bin User-agent: coccocbot-web User-agent: coccocbot-image Disallow: /ajax
Note that you can use comments in your robots.txt file. All characters from the first # in a line up to the end of the line are not analyzed by robots.
All empty lines in the file are ignored.
You can mention the same user agent multiple times. In this case, all instructions for that robot are used together, for example:
# All of Coc Coc's robots are instructed to not download any documents from /cgi-bin and /ajax. # All other robots are still allowed to download all documents from the site. User-agent: coccocbot Disallow: /cgi-bin User-agent: * Allow: / User-agent: coccocbot Disallow: /ajax
Disallow and Allow
If you want to instruct robots to not access your site or certain sections of it, use the Disallow directive. For example:
# Disallow access to the whole site for all robots User-agent: * Disallow: /
# Disallow access to pages starting with '/cgi-bin' for coccocbot-image User-agent: coccocbot-image Disallow: /cgi-bin
To allow robots to access your site or its parts, use the Allow directive. For example:
# Disallow access of all Coc Coc's robots to all pages of the site except URLs which start with '/docs' User-agent: coccocbot Disallow: / Allow: /docs
An empty Disallow directive allows robots to download all pages of the site. An empty Allow directive is ignored.
# Empty Disallow directive Disallow: # Empty Allow directive Allow:
Using directives jointly
If there are multiple directives which can be applied to a URL, the most specific directive is used.
# Disallow access of all Coc Coc's robots to pages starting with '/cats' # but allow access to pages starting with '/cats/wild', except those pages which start with '/cats/wild/tigers' User-agent: coccocbot Disallow: /cats Allow: /cats/wild Disallow: /cats/wild/tigers
If two directives (Allow and Disallow) are equally specific, the Allow directive takes precedence.
# Allow access of all Coc Coc's robots to pages starting with '/dogs/naughty' despite the presence of the Disallow directive User-agent: coccocbot Disallow: /dogs/naughty Allow: /dogs/naughty
Special symbols * and $
The asterisk (*) in Allow and Disallow directives means any sequence of characters. Note that, by default, every !Allow/Disallow directive implies a trailing *. To cancel this behavior, add a dollar sign ($) to the end of the rule.
# Disallow access to all urls containing 'private' in their paths User-agent: coccocbot Disallow: /*private
# Disallow access to all urls ending with '.ajax' User-agent: coccocbot Disallow: /*.ajax$
Sitemap directive
You can add the Sitemap directive to instruct our robots to use sitemap files. The Sitemap directive is independent of the User-agent directives. Multiple Sitemap directives are allowed.
Sitemap: http://site.vn/sitemaps1.xml Sitemap: http://site.vn/sitemaps2.xml
Crawl-delay directive
If you want to lower the rate at which Coc Coc's robots visit your site, you can use the Crawl-delay directive. Coc Coc's robots interpret the Crawl-delay value as an integer number of seconds the robot must wait between two consecutive requests. Please note that our robots don't support crawl delays greater than 10 seconds. Therefore, a crawl delay of 100 seconds is treated as a 10 second crawl delay. The Crawl-delay directive is User-agent specific, so add it to the User-agent section which is used by our robot.
# For all robots supporting the Crawl-delay directive User-agent: * Crawl-delay: 10
# Crawl delay is 10 seconds for all robots except all of Coc Coc's robots, # for all of Coc Coc's robots, the crawl delay is 5 seconds User-agent: * Crawl-delay: 10 User-agent: coccocbot Crawl-delay: 5