Robots Exclusion Standard
Also called: robots Exclusion Protocol or robots.txt
The Robots Exclusion Protocol includes some options for giving search engines instructions on which pages within a Web site may and may not be indexed. These capabilities can be used when it is undesirable for certain pages to be included in search results.
The "robots" are usually the spiders of search engines, programs that continuously search the Web for new information for search engines. There are also web robots for other purposes. Whether the instructions are followed according to the standard depends on the particular robot. Thus, the protocol offers no guarantees. The crawlers of most major search engines (such as Google and Bing) respect these standards.
The XML Sitemap is a protocol for inclusion of pages in search engines.
Robots.txt
Robots.txt is a file that is stored within the root directory of a domain(domainname.co.uk/robots.txt) and tells search engines which locations within the website they may or may not query.
Example of a robots.txt file:
User-agent: *
Disallow: /cgi-bin/
Disallow: /admin/
In this example, all robots are instructed not to crawl locations within the /cgi-bin/ and /admin/ directories. This example immediately highlights a disadvantage of robots.txt: the openness of the file can also actually expose locations we would rather not bring to attention.
Multiple terms and rules can be placed underneath each other. For example, a section that focuses purely on Google's crawlers starts with the rule "User-agent: googlebot." The prefix "Allow:" can also be used to create exceptions that are actually allowed to be accessed.
Robots.txt purely indicates which locations should not be queried by spiders. In theory, a search engine can include such a location in its search results, it just has no knowledge of the content of the page.
Another way to influence the behavior of spiders is a special meta tag for robots. This HTML tag does not prevent spiders from retrieving the content of a page, but it does then give more control over what happens to the location and content.
Example of a robots meta tag:
<meta name="robots" content="noindex,nofollow" />
This example prescribes that the location of the page may not be included in search results. Also, hyperlinks on the page may not be tracked. Counterparts of "noindex" and "nofollow" are "index" (do include) and "follow" (do follow links).