How to Audit Your robots.txt for SEO

The most expensive SEO mistakes on this file come from misunderstanding what robots.txt actually does. Every quarter, somebody on every team adds a Disallow rule trying to hide a page from Google - and Google indexes the URL anyway, with no content under it. Or they paste a generic snippet from Stack Overflow that quietly blocks the assets Googlebot needs to render the page. The file is twelve lines of plain text, and a single wrong character can deindex your homepage.

This guide skips the syntax reference (the tool below covers that) and focuses on what robots.txt is actually for, the five mistakes that hurt rankings on real sites, and the 15-minute audit you can run before your next deploy.

Check any site's robots.txt right now

Paste any domain. The tool fetches the file, parses every directive, and tells you exactly which paths each bot is allowed or blocked from crawling - useful both for auditing your own site and for reverse-engineering what a competitor blocks.

Try it inline

Robots.txt Checker

Open full tool

Loading tool…

What robots.txt actually controls (and what it doesn't)

This is the single biggest mental-model fix on the file. Most teams use robots.txt as if it were a "do not index" instruction. It is not. The directive is Disallow, and it instructs well-behaved bots not to crawl a URL. Crawling and indexing are two different things, and the gap between them is where most of the costly mistakes live.

Crawling - Googlebot fetches the page's HTML.
Indexing - Google decides whether to include that URL in its search index.

If a URL is linked from anywhere on the public web, Google can - and often does - index that URL even when robots.txt tells it not to crawl. The result is the worst of both worlds: the URL appears in the SERP with the gray "No information is available for this page" snippet, because Google can see the URL exists but can't read what's on it. You wanted it hidden; you got it advertised.

The decision rule:

Want a URL not crawled (because crawling it wastes budget - e.g. a faceted filter that generates millions of low-value combinations) → use robots.txt Disallow.
Want a URL not indexed (because the page exists for users but shouldn't show in Google) → let Googlebot crawl it, and serve a <meta name="robots" content="noindex"> tag (or an X-Robots-Tag header). Disallow makes this worse, not better - Googlebot can't fetch the page to see the noindex.

The number of "we removed it from robots.txt and the URL stopped appearing in Google within a week" stories on SEO forums is unreasonable. The fix isn't intuitive - but if you've ever wondered why a URL you blocked is still in the index, this is almost always why.

The five robots.txt mistakes that hurt SEO

1. Using `Disallow` to keep pages out of the index

Covered above - the most common and most expensive mistake on this file. If a page is sensitive (admin, internal tooling, paywalled content), don't list it in robots.txt. That file is publicly readable. You're advertising the URL to anyone who visits yourdomain.com/robots.txt. Use authentication for sensitive content; use noindex with crawling allowed for "exists for users, hidden from Google".

2. Blocking CSS or JavaScript

The legacy advice - "block /wp-content/ to save crawl budget" - is from a decade ago and now actively breaks rankings. Google renders pages the way a browser does, which means it needs to fetch your CSS and JS to evaluate layout, mobile-friendliness, and Core Web Vitals. Block those assets and Googlebot sees an unstyled, broken page. That's what gets ranked.

Audit move: search your robots.txt for Disallow rules touching .css, .js, /wp-content/, /wp-includes/, /static/, /_next/, or any framework's build-output directory. If you find any, the fix is to remove the rule. Google has been explicit since 2014 that blocking rendering assets is a ranking issue, and it's only become more important as core algorithms have leaned harder on rendered output.

3. Forgetting the `Sitemap:` declaration

One line at the bottom of robots.txt tells every search engine where your XML sitemap lives:

Sitemap: https://yourdomain.com/sitemap.xml

This isn't a Disallow rule - it's a discoverability hint. Googlebot reads robots.txt on essentially every crawl session, so the Sitemap: line gets re-read constantly. Without it, you're relying on Search Console's manual sitemap submission and on Googlebot eventually finding the sitemap by guessing common paths. Both are slower than just declaring it once.

You can declare multiple sitemaps. If you split sitemaps by section (posts, products, categories), list each one on its own Sitemap: line.

4. Letting `robots.txt` 5xx during deploys

This one is sneaky. When Googlebot requests /robots.txt and gets a 5xx response (or a connection timeout), it doesn't fall back to "no rules apply, crawl freely." It pauses crawling your site entirely until the file returns a valid response. Google will retry, but extended 5xx windows can shrink your crawl rate for days afterwards.

Audit move: include /robots.txt in your uptime monitor's checks (most teams monitor the homepage and forget the file). If you're behind a CDN, confirm the file is cached at the edge so origin outages don't take it down. And if you ever take the site down for maintenance, serve robots.txt through a different path - even a hardcoded static response - so it stays available.

5. Wildcard rules that block more than you intended

The wildcard syntax in robots.txt is permissive but unintuitive. The pattern Disallow: /*? is supposed to block parameterized URLs like /page?ref=email - but it also blocks legitimate URLs that happen to contain ? anywhere. Common offenders:

Disallow: /*? - blocks every URL with a query string, including search pages, filtered category pages, and tracking-parameter-decorated entrances from email or paid traffic.
Disallow: /*.pdf - without the trailing $ anchor this blocks anything containing .pdf in the path, including /blog/pdf-formats-explained. The correct form is Disallow: /*.pdf$.
Disallow: / - the nuclear option. A single forward slash deindexes the entire site. Always check this isn't lurking after a copy-paste.

Audit move: any time you add a wildcard rule, run the checker above against three test URLs - one you expect to be blocked, one you expect to be allowed, and one that's an edge case. If the third surprises you, refine the rule before deploying.

Checklist

robots.txt DOs & DON'Ts

Use a noindex meta tag for "keep out of search results"
robots.txt only blocks crawling, not indexing. A noindex tag is the only directive that actually keeps a URL out of Google's index.
Allow CSS, JS, and asset directories to be crawled
Google renders pages the way a browser does. Blocking /static/, /_next/, or /wp-content/ breaks rendering and hurts rankings.
Declare your sitemap with a `Sitemap:` line
Googlebot reads robots.txt on most crawls, so the Sitemap declaration is high-leverage. Multiple sitemaps? List one per line.
Test wildcard rules against three sample URLs before deploying
One you expect to be blocked, one allowed, one edge case. If the third surprises you, refine the rule before it ships.
Monitor /robots.txt uptime alongside your homepage
A 5xx on robots.txt pauses Googlebot's crawl site-wide until the file returns a valid response.

DON'T

Don't use Disallow to hide sensitive content
robots.txt is publicly readable - listing /admin in it advertises the URL. Use authentication for anything sensitive.
Don't block /wp-content/, /static/, or framework asset directories
Legacy advice. Modern Googlebot needs those assets to render the page; blocking them breaks ranking signals like Core Web Vitals.
Don't expect Googlebot to honor `Crawl-delay`
Google ignores the directive entirely. Use Search Console's crawl rate setting if you genuinely need to throttle.
Don't ship a robots.txt change without testing
A single stray `Disallow: /` deindexes the entire site. Always preview the live file with the checker before considering a deploy done.
Don't return 5xx on /robots.txt during deploys
It's worse than a 404. Cache the file at the edge or serve it from a static path so origin downtime doesn't take it offline.

What a clean robots.txt audit looks like

Run this after any deploy that touches robots.txt, after any framework upgrade (Next.js, Rails, WordPress plugins all manage this file differently), and on a quarterly cadence regardless. It takes 15 minutes for a small site, 30 for a larger one.

Fetch and read the actual file. Don't trust your CMS preview - fetch https://yourdomain.com/robots.txt directly in a browser or via curl. Confirm what's actually being served, not what's in your config. The two diverge surprisingly often.
Verify it returns 200. A 404 on /robots.txt is fine - Googlebot treats it as "no rules". A 5xx is not. If you don't have a robots.txt file, serve a 404 cleanly - don't let your CMS render an HTML error page with a 200 status code.
Check for blocked rendering assets. Search the file for .css, .js, /static/, /_next/, /wp-content/, /wp-includes/. None of these should be in a Disallow rule. Use Search Console's URL Inspection tool on a few key pages and check the "Page resources" section for blocked resources.
Confirm the Sitemap declaration is present and current. The Sitemap: line should point to a URL that itself returns 200 and contains your latest content. If you renamed your sitemap path during a migration, update robots.txt the same day.
Test every wildcard rule against three URLs. Use the embedded checker above. Pick one URL you expect to be blocked, one you expect to be allowed, and one edge case (a query-string URL, a file-extension match, a trailing-slash variant). All three should match your expectation.
Cross-reference with Search Console. Open the "Pages" report → "Why pages aren't indexed" → "Blocked by robots.txt". Every URL listed here is a deliberate block - confirm you actually meant to block it. Anything in the list that looks like a legitimate page that should be ranking is your priority fix.
Spot-check via URL Inspection. Search Console no longer hosts the standalone robots.txt tester, but the URL Inspection tool will tell you whether a specific URL is blocked. Spot-check 5-10 important URLs (top traffic, top conversion, recent launches) and confirm none are unexpectedly blocked.

One thing the tool can't tell you: `Crawl-delay` doesn't apply to Google

If you've inherited a robots.txt with Crawl-delay: 10 or similar, know that Googlebot ignores it entirely. The directive is honored by Bingbot, Yandex, and a few others, but Google sets crawl rate via Search Console settings (and increasingly auto-throttles based on server response times). Setting Crawl-delay for Google does nothing - neither good nor bad - but it can quietly slow your indexing on Bing if the value is too aggressive.

If you actually need to slow Googlebot down (rare - usually a sign your server can't handle the crawl), use Search Console's crawl rate setting, not robots.txt.

Free eBook

Grab The SEO Blueprint.

How to get found on Google, get cited by AI, and attract customers on autopilot - a practical guide for business owners and entrepreneurs.

Keyword research and on-page SEO tactics
Technical SEO and link building strategies
A 90-day SEO action plan

The SEO Blueprint - free eBook by Shammika Munugoda

Quick quiz: are you ready to audit your own robots.txt?

Five questions, takes two minutes. We'll show you the right answer and a one-line explanation after each one.

Quick quiz · 5 questions

robots.txt - quick check

5 randomized questions drawn from a pool of 12. Different every time you take it. Takes about two minutes.

Next up in Technical SEO

You've covered status codes (the entry-point signal) and robots.txt (what gets crawled). The rest of the Technical SEO pillar:

XML sitemaps - why a broken one is worse than none, and how to keep yours current as your content grows.
Crawlability vs indexability - the two checks Google makes before your page can rank, and why a page can pass one and fail the other.
Mixed content and HTTPS - the audit that takes 30 seconds and often reveals a year's worth of debt.

Check any site's robots.txt right now

What robots.txt actually controls (and what it doesn't)

The five robots.txt mistakes that hurt SEO

1. Using `Disallow` to keep pages out of the index

2. Blocking CSS or JavaScript

3. Forgetting the `Sitemap:` declaration

4. Letting `robots.txt` 5xx during deploys

5. Wildcard rules that block more than you intended

robots.txt DOs & DON'Ts

What a clean robots.txt audit looks like

One thing the tool can't tell you: `Crawl-delay` doesn't apply to Google

Grab The SEO Blueprint.

Quick quiz: are you ready to audit your own robots.txt?

robots.txt - quick check

Next up in Technical SEO

More in Technical SEO

How to audit HTTP status codes for SEO

The XML sitemap playbook for SEO

How to audit crawlability and indexability

Skip the writing. Keep the SEO.

Check any site's robots.txt right now

What robots.txt actually controls (and what it doesn't)

The five robots.txt mistakes that hurt SEO

1. Using Disallow to keep pages out of the index

2. Blocking CSS or JavaScript

3. Forgetting the Sitemap: declaration

4. Letting robots.txt 5xx during deploys

5. Wildcard rules that block more than you intended

What a clean robots.txt audit looks like

One thing the tool can't tell you: Crawl-delay doesn't apply to Google

Grab The SEO Blueprint.

Quick quiz: are you ready to audit your own robots.txt?

robots.txt - quick check

Next up in Technical SEO

More in Technical SEO

How to audit HTTP status codes for SEO

The XML sitemap playbook for SEO

How to audit crawlability and indexability

Skip the writing. Keep the SEO.

1. Using `Disallow` to keep pages out of the index

3. Forgetting the `Sitemap:` declaration

4. Letting `robots.txt` 5xx during deploys

One thing the tool can't tell you: `Crawl-delay` doesn't apply to Google