The robots.txt file is a simple yet powerful SEO tool. It allows to control the crawling behavior of search engines (Google, Bing) and bots. Get the most out of your crawl budget, prevent bots from detecting duplicate content and unnecessarily burdening your server. If you use robots.txt correctly, everything is possible.
A good understanding of robots.txt should not be underestimated. The logic seems simple, and therein lies the danger. I very regularly come across implementations that raise my eyebrows. From unnecessary repetition, wrong syntax, to rules that just don’t work. And most people just don’t know about it. In this article you will find a compilation and what you can do about it, divided into 2 important SEO rules.
Rule 1: only the most specific User-agent counts
To explain this I am going to quote two examples from tiktok.com and twitter.com. These are not exactly small websites. In fact, these are the 15th and 5th most visited websites worldwide. We can assume that crawl budget is a thing for them.
Before we dive into these two cases, let me explain what exactly the error is. I use a simple code example for robots.txt for this:
Apparently many people think there is inheritance. Simply put, inheritance means that you can write ‘general rules’ that apply to everyone, as well as specific rules. In other words, in this example it is thought that everything under User-agent: * automatically also applies to Googlebot & Bingbot. That is not true.
‘ User-agent: * ‘ means that your rules apply to all bots. If you only want to address the Googlebot, write ‘ User-agent: Googlebot ‘. It is therefore often assumed that you can put all common rules under ‘User-agent: *’ and only write the specific rules for Googlebot there. But as soon as you mention a specific Googlebot user-agent, Google only looks at those rules. And so anything under ‘User-agent: *’ doesn’t apply to Google (in this case).
If you don’t believe me, this is what Google’s robots.txt tester says :
As already mentioned: large websites also make this mistake. I explain them below.
1. The robots.txt from tiktok.com
A fantastic example is TikTok’s /robots.txt , which blocked BingBot on its /discover folder one day after announcing the ChatGPT/Bing integration.
What they may not have noticed: now all Disallow lines under User-agent: * are no longer blocked for BingBot. And so Bing can suddenly crawl all those URLs. This is not a good evolution for their crawl budget.
For those wondering what TikTok’s /discover URLs are, here are the links.
As I said, throw this into a robots.txt tester tool and you’ll see the URL is crawlable.
TikTok clearly didn’t realize this, because a little later they added an extra line:
- Disallow: /link
But if you understand the logic I described earlier, then you know that this rule also does not apply to Bingbot. I’m sure TikTok is not aware of this.
So what should TikTok actually do? Well actually they can write this:
Yes, this is allowed, and this now works for all other bots and Bingbot. Note that Bingbot is listed a second time at the bottom with a specific line.
2. Twitter’s robots.txt
And Twitter also makes the exact same mistake with their robots.txt file. If this file changes in the future, you can find the July 2023 version here .
What you see here?
Twitter considers the rules marked with curly brackets to be excluded for all bots. They even explicitly mention this via a comment. But higher up in their robots.txt file, those same lines are not under Googlebot. So Googlebot can just crawl those pages.
It concerns these rules:
- Disallow: /oauth
- Disallow: /1/oauth
- Disallow: /i/streams
- Disallow: /i/hello
- Disallow: /i/u
3. Luck in an accident: case via SEO Mastermind
I recently saw a nice case in the SEO Mastermind community of someone who had questions about a questionable robots.txt implementation. The web builder was going to block the crawling of JS, CSS and the like (not a good idea!) for all bots, but had also written separate rules for Googlebot. A happy accident, because that means those Disallow rules no longer count for Google.
So that was rule 1: only the most specific User-agent counts. On to line 2.
Rule 2: Avoid unnecessary repetition of rules per user-agent
I find it striking that LinkedIn’s robots.txt (at the time of writing) contains 4,215 lines. At first I thought: ‘That’s a big website. It will certainly be necessary.’ Nothing turns out to be less true. The rules do work in themselves, so that’s a good thing. But it’s just unnecessarily long, chaotic and very difficult to maintain (inefficient). Moreover, this leaves the door open for errors, which are also difficult to detect.
All together it is already a file of 100kB. Not a problem in itself: Google caches your robots.txt (luckily!) for a certain amount of time, so they don’t have to request it again with every crawl request. Still, you can’t exactly call this efficient. Google has to process 4,000+ rules each time to check what they can/cannot crawl. Moreover, such a disordered file is very difficult to maintain. Something that also appears from my analysis of their rules.
What is LinkedIn doing wrong?
As the screenshot below shows, LinkedIn writes all the rules for each user agent over and over again, even if they are 100% the same rules (360spider and Sogou for example).
Wouldn’t it be more efficient?
Of course it is. Otherwise I don’t care about this.
You can simply merge User-Agents with the same rules. That makes it much more readable. Below the screenshot I have placed an example of the code. This alone saves you 114 lines of code. There are 38 separate User-agents listed in this file, and I see that those exact 114 lines match 17 User-agents already.
Total cut from this one merge : 1938 lines of code, or -50%. Voila.
But it can be even better. I made some calculations via Google Sheets to only list each disallow/allow rule once. Even if one user-agent has 113 lines, and the other has 114 lines (one specific line extra).
The result is that I shortened 4,215 lines of code to 514 lines of code . You can find the shortened version here: https://sitesfy.be/linkedin/robots.txt .
For a comparison before and after optimization, you can look here: https://sitesfy.be/show/linkedin .
Fun fact: I did this exercise months ago, and then I got the code shortened to exactly 300 lines. In the meantime, however, LinkedIn made some changes to their source file, and my optimized version suddenly became a lot longer.
What’s the benefit?
All this is much more readable and easier to maintain. My suspicion is that LinkedIn should do some updates on this. It seems too coincidental that some lines are not included with all User-agents. As you will notice in the optimized robots.txt, you can suddenly spot this very easily. Just look for this line:
Disallow: /slink*
You will see it listed under 35 User-agents. However, higher in the document, 114 global rules are listed with 36 User-Agents. Maybe LinkedIn forgot a mention somewhere?
Other common errors
I’m running into quite a few other interesting errors in the robots.txt file. I would like to elaborate on that in another article. For now you will receive a short summary from me:
- Block URLs that are already indexed.
- Rule Conflicts: The most specific rule applies.
- Writing incorrect lines (often because one does not know the syntax).
- Writing rules that don’t exclude everything (often with parameters).
- Block resources that are important for rendering.
- Overwriting robots.txt on new release (often with Drupal), without knowing it.
- The classic: entire website blocked after migration.
- No monitoring set up.
- Listed a relative URL as a sitemap.
Avoid these robots.txt errors
I regularly come across incorrect implementations of robots.txt. So many different errors, in fact, that I couldn’t discuss them all in this article. The focus was on detailing 2 very drastic – but often unknown – errors. On the one hand, you have to understand that only the most specific user agent is applicable. On the other hand, it is important to keep your robots.txt readable by bundling user-agents with the same rules.
Now that you know why these errors occur, and what you can do about them, I hope that you, the reader, can benefit from this.