What is SEO?

Robots.txt is a small but very important file used especially by developers and online marketers. With a robots.txt you tell search engines which pages of a website may or may not be crawled. It may sound a bit technical, but the way it works is actually quite simple. In this blog I'll explain to you what robots.txt is, how it works, and why it's important for your website as well!
A robots.txt file sends instructions to search engine bots, such as Google's. Search engine bots crawl your Web site to determine which pages to display in search results. Through the robots.txt, you give the bots specific guidelines. Think of the robots.txt file as a kind of host, which also knows exactly where everyone should be. The robots.txt is the host of your website, it tells the bots which pages they may or may not visit (crawl).
The file itself is a text file located in the root of your website. You can find it for each site by putting /robots.txt after the domain. (For example: onlinemarketingagency.co.uk/robots.txt). Importantly, a website may not have a robots.txt file, in which case you won't be able to find it either. It contains rules that tell bots where they can and cannot go. The following instructions can be recognized in the robots.txt
The example of OMA:
Note: there are more examples at the bottom of this blog
When using robots.txt files on subdomains, it is important to note that each subdomain requires its own file. The rules on www.domein.nl/robots.txt, for example, do not apply to a subdomain such as blog.domain.co.uk. Make sure the settings for each subdomain match what you want to crawl or block. For a testing or staging environment you can use a disallow to block search engines, but for more security it is better to use IP whitelisting or password protection, because robots.txt is not a security measure.
Check in Google Search Console that subdomains are crawled correctly and add separate properties for monitoring. In addition, use a separate sitemap per subdomain, list it in the robots.txt, and test the settings regularly.
A properly set robots.txt file helps you maintain control over what is and is not ultimately crawled. Because of the crawl budget, it is important to give bots guidelines on which pages should or should not be crawled. Let search engines focus on the most important pages of your website that may [read: should] be indexed.
Suppose you have a page that is important for customers, but not for search engines. Then we are talking about account pages, for example from web shops. With robots.txt you can easily indicate that these pages should not be crawled. The advantage? Users don't notice anything and your crawl budget isn't wasted.
Furthermore, a good robots.txt file helps manage your crawl budget. After all, search engines only have a limited number of pages they crawl per day. By excluding unimportant pages, you ensure that the bots focus on the content that really matters. A properly configured robots.txt makes more efficient use of the crawl budget and can improve the findability of your most important pages.
Note that misconfiguration, inadvertently blocking valuable content from crawling, can lead to reduced indexing and thus a negative impact on your SEO.
First things first, never block your important pages from crawling! It is important for the final performance of your website that you do not block pages that you want to appear in search results. Therefore, always check your robots.txt file carefully. In addition, use Google's test tool in Google Search Console (GSC). Within GSC, you will find a robots.txt tester that you can use to see if your file is working properly. You can find it on al following:
Google Search Console: settings > robots.txt
Click on the three dots near the robots.txt file where you made the changes and then on the (only) option "Request a new crawl.
Be careful about excluding important information through the robots.txt. Although you can use robots.txt to block pages from bots, they are still accessible if someone enters the URL directly. If someone opens your robots.txt file, he or she can see what pages have a disallow. So never use robots.txt for sensitive information, then use an htaccess lock.
Note that the robots.txt is case sensitive, everything should be in lower case.
All in all, the robots.txt file is a simple but effective way to control search engine bots. By excluding the right pages from crawling, you can focus on the content that really matters for SEO. Have you already set up the file properly? If not, it's a simple step you can take right away to improve your SEO!
To get you started, I've listed some examples that you can adopt for your own robots.txt file.
Prevent search engines from crawling a particular folder on your website, such as an admin dashboard:
User-agent: *
Disallow: /admin/
Block a specific search engine, such as Bing, while allowing others:
User agent: bingbot
Disallow: /
If you have blocked the entire site, but want to make one folder accessible:
User-agent: *
Disallow: /
Allow: /public/
Avoid having search engines crawl certain file types, such as PDF files:
User-agent: *
Disallow: /*.pdf$
Inform search engines of the location of your sitemap:
User-agent: *
Sitemap: https://www.domein.nl/sitemap.xml
Prevent a specific page from being crawled by search engines:
User-agent: *
Disallow: /privacy-policy.html
Block pages with certain URL parameters, for example filters or search results:
User-agent: *
Disallow: /*?filter=
Allow only Googlebot and block all other bots:
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
Block multiple subfolders on a website:
User-agent: *
Disallow: /private/
Disallow: /temp/
Disallow: /backup/
Set a delay for search engines to avoid overloading your server:
User-agent: *
Crawl-delay: 10
Note: Not all search engines support a crawl delay
Yes and what would a robots.txt file look like if the above lines were in it? Fair question! Would you get some idea of how such a file is classified.
Important: This file contains a mix of general and specific rules. Always check that the configuration matches your goals and technical setup.
Want to see a sample robots.txt file? Then check out the one from The New York Times!
A file that tells search engines which pages may or may not be crawled.
Don't block valuable pages, don't use robots.txt for sensitive info, and check regularly via Google Search Console.
You can test the file with this tool or by opening the URL in a browser and checking that the lines display correctly.
Some bots, such as "malicious" crawlers, can ignore the rules in robots.txt and still crawl your website.
No, robots.txt only prevents crawling.
Written by: Giuliano Koelewijn
This is Giuliano. A beast in the gym. But a teddy bear in the office. Now, as an online marketer, Giuliano helps get your marketing funnels back in shape.