What is a robots.txt file, and why is it important?
Robots.txt file is a simple text file that sits on the root folder of your website.
It is one of the simplest yet most talked about files on your website
If your website is www.yourwebsite.com, the robots.txt file can be found by going to www.yourwebsite.com/robots.txt.
Its main purpose is to instruct Google and other search engine bots on what pages and folders they are allowed to crawl or not to crawl. If you wish to keep certain parts of your website not crawlable to search engines, this is one of the first steps you would take. You can think of the robots.txt file as a wall around your website. You tell who's allowed to come in and who's not!
So do I need a robots.txt file for my website?
If you have a simple or smaller website that has less than 50-100 pages, then not having a robots.txt file will not hugely impact your site. If you fall into any of the following categories, you may not need a robots.txt file:
1) Your site is small, simple, and has a simple site structure.
2) You want all of your content to be crawled and indexed by search engines.
3) You have nothing you want to be blocked for search engine crawlers.
It's also worth pointing out that you will not get higher SEO rankings simply because you have a robots.txt file.
But if you have sections that you don't want to be crawled by search engine crawlers, by all means, create one. After all, it only takes a few minutes to create one! Robots.txt file is one of the first files search engine crawlers scan, so it is recommended to have one.
While most crawlers (known as user agents) obey these rules specified in the robots.txt file, some bots completely ignore these instructions and crawl your website anyways! A list of user agents can be found at the end of this post.
How to create a robots.txt file?
The good news is that most CMSs out there will automatically create a robots.txt file for you. For instance, if you are using WordPress, you can easily create a robots.txt file using any SEO plugins such as Yoast or RankMath.
If you want to create this file manually, simply create a text file named robots.txt and save it within the root folder of your website.
Correct syntax of a robots.txt file
User-agent: [user-agent name] Disallow: [URLs not to be crawled]
For example, say you have some PDF files in a folder called /PDF-docs on your website that you don't want to be crawled, you can add this code to your robots file.
User-agent: * Disallow: /PDF-docs/
This will make sure that all files inside the PDF-docs will not be crawled.
If you want only a specific file to be ignored,
User-agent: * Disallow: /PDF-docs/file-that-dont-need-to-be-crawled.pdf
If you want these files to be crawled by other crawlers but only want to prevent Google from crawling, you can
User-agent: Googlebot Disallow: /PDF-docs/
If you have files other than PDFs in this folder, they will also be blocked from crawling. The following syntax will make sure only the PDF files will be blocked. If you have any other types of files, such as image files, they will be crawled.
User-agent: Googlebot Disallow: /PDF-docs/*.pdf
The following instructs all search engine bots not to index any of the website's files or folders. This is done by disallowing the root / of your website.
User-agent: * Disallow: /
You can have this syntax if you want everything to be crawled by all search engines.
User-agent: * Disallow:
As you can see, one incorrect syntax could block off your entire website from every single search engine crawler. So it makes sense to pay extra attention to the syntax of your robots.txt file.
If you have subdirectories and files within a blocked directory that you want to be crawled, the Allow directive is a handy one. Going back to our example, say if you have a sub-folder within the /PDF-docs/ folder, you want to be crawled.
User-agent: Googlebot Disallow: /PDF-docs/ Allow: /PDF-docs/important/
Disallow: /PDF-docs/ line tells not to crawl everything inside the PDF-docs folder
Allow: /PDF-docs/important/ line allows everything inside the important folder to be crawled even though we told in the previous line to block everything inside PDF-docs.
You can specify how frequently you want search engines to crawl your pages. You can do so using the crawl delay command.
crawl-delay: 20 -> this instructs search engines should wait 20 seconds in between crawling pages. If you have a very large site with lots of pages, it can only crawl 86400/20 = 4320 pages in a day; if your site has more than that, you may want to reduce the frequency.
Another important line in your robots.txt file makes life much easier for search engine crawlers. That is the sitemap file. It is always best practice to indicate the location of your XML sitemap file in your robots.txt file.
Like an index page or table of contents of a book guides its reader through the book, a sitemap file guides search engine crawlers through a website.
robots.txt Best Practices
It is worth noting that just because you add a robots.txt file and block some pages, they may still get indexed by Google, especially if other pages link to those pages. To prevent pages you don't want to be indexed. Then you have to use the noindex meta tag.
Another point is you should always save your robots.txt file as robots.txt (all lower case) and never as Robots.txt or ROBOTS.txt.
Also, you always specify the relative path to the disallowed folders and files and never the absolute path. So never do this.
This will be read as
Because The disallow value always infers the beginning of the URL’s path.
What is a user agent?
A user agent typically means any piece of software capable of acting as a user. User agents can access, retrieve and render anything on the web. Browsers such as Mozilla Firefox, Google Chrome, Safari, and Microsoft Edge are examples of browser user agents. However, other applications can act as user agents. For instance, web crawlers such as Googlebot, Bingbot, and Yahoo.
List of common user agents
- Slurp (Yahoo)
- ia_archiver (Alexa)