The most expensive SEO mistakes on this file come from misunderstanding what robots.txt actually does. Every quarter, somebody on every team adds a Disallow rule trying to hide a page from Google - and Google indexes the URL anyway, with no content under it. Or they paste a generic snippet from Stack Overflow that quietly blocks the assets Googlebot needs to render the page. The file is twelve lines of plain text, and a single wrong character can deindex your homepage.
This guide skips the syntax reference (the tool below covers that) and focuses on what robots.txt is actually for, the five mistakes that hurt rankings on real sites, and the 15-minute audit you can run before your next deploy.
Check any site's robots.txt right now
Paste any domain. The tool fetches the file, parses every directive, and tells you exactly which paths each bot is allowed or blocked from crawling - useful both for auditing your own site and for reverse-engineering what a competitor blocks.
What robots.txt actually controls (and what it doesn't)
This is the single biggest mental-model fix on the file. Most teams use robots.txt as if it were a "do not index" instruction. It is not. The directive is Disallow, and it instructs well-behaved bots not to crawl a URL. Crawling and indexing are two different things, and the gap between them is where most of the costly mistakes live.
- Crawling - Googlebot fetches the page's HTML.
- Indexing - Google decides whether to include that URL in its search index.
If a URL is linked from anywhere on the public web, Google can - and often does - index that URL even when robots.txt tells it not to crawl. The result is the worst of both worlds: the URL appears in the SERP with the gray "No information is available for this page" snippet, because Google can see the URL exists but can't read what's on it. You wanted it hidden; you got it advertised.
The decision rule:
- Want a URL not crawled (because crawling it wastes budget - e.g. a faceted filter that generates millions of low-value combinations) → use
robots.txtDisallow. - Want a URL not indexed (because the page exists for users but shouldn't show in Google) → let Googlebot crawl it, and serve a
<meta name="robots" content="noindex">tag (or anX-Robots-Tagheader). Disallow makes this worse, not better - Googlebot can't fetch the page to see the noindex.
The number of "we removed it from robots.txt and the URL stopped appearing in Google within a week" stories on SEO forums is unreasonable. The fix isn't intuitive - but if you've ever wondered why a URL you blocked is still in the index, this is almost always why.
The five robots.txt mistakes that hurt SEO
1. Using Disallow to keep pages out of the index
Covered above - the most common and most expensive mistake on this file. If a page is sensitive (admin, internal tooling, paywalled content), don't list it in robots.txt. That file is publicly readable. You're advertising the URL to anyone who visits yourdomain.com/robots.txt. Use authentication for sensitive content; use noindex with crawling allowed for "exists for users, hidden from Google".
2. Blocking CSS or JavaScript
The legacy advice - "block /wp-content/ to save crawl budget" - is from a decade ago and now actively breaks rankings. Google renders pages the way a browser does, which means it needs to fetch your CSS and JS to evaluate layout, mobile-friendliness, and Core Web Vitals. Block those assets and Googlebot sees an unstyled, broken page. That's what gets ranked.
Audit move: search your robots.txt for Disallow rules touching .css, .js, /wp-content/, /wp-includes/, /static/, /_next/, or any framework's build-output directory. If you find any, the fix is to remove the rule. Google has been explicit since 2014 that blocking rendering assets is a ranking issue, and it's only become more important as core algorithms have leaned harder on rendered output.
3. Forgetting the Sitemap: declaration
One line at the bottom of robots.txt tells every search engine where your XML sitemap lives:
Sitemap: https://yourdomain.com/sitemap.xml
This isn't a Disallow rule - it's a discoverability hint. Googlebot reads robots.txt on essentially every crawl session, so the Sitemap: line gets re-read constantly. Without it, you're relying on Search Console's manual sitemap submission and on Googlebot eventually finding the sitemap by guessing common paths. Both are slower than just declaring it once.
You can declare multiple sitemaps. If you split sitemaps by section (posts, products, categories), list each one on its own Sitemap: line.
4. Letting robots.txt 5xx during deploys
This one is sneaky. When Googlebot requests /robots.txt and gets a 5xx response (or a connection timeout), it doesn't fall back to "no rules apply, crawl freely." It pauses crawling your site entirely until the file returns a valid response. Google will retry, but extended 5xx windows can shrink your crawl rate for days afterwards.
Audit move: include /robots.txt in your uptime monitor's checks (most teams monitor the homepage and forget the file). If you're behind a CDN, confirm the file is cached at the edge so origin outages don't take it down. And if you ever take the site down for maintenance, serve robots.txt through a different path - even a hardcoded static response - so it stays available.
5. Wildcard rules that block more than you intended
The wildcard syntax in robots.txt is permissive but unintuitive. The pattern Disallow: /*? is supposed to block parameterized URLs like /page?ref=email - but it also blocks legitimate URLs that happen to contain ? anywhere. Common offenders:
Disallow: /*?- blocks every URL with a query string, including search pages, filtered category pages, and tracking-parameter-decorated entrances from email or paid traffic.Disallow: /*.pdf- without the trailing$anchor this blocks anything containing.pdfin the path, including/blog/pdf-formats-explained. The correct form isDisallow: /*.pdf$.Disallow: /- the nuclear option. A single forward slash deindexes the entire site. Always check this isn't lurking after a copy-paste.
Audit move: any time you add a wildcard rule, run the checker above against three test URLs - one you expect to be blocked, one you expect to be allowed, and one that's an edge case. If the third surprises you, refine the rule before deploying.
What a clean robots.txt audit looks like
Run this after any deploy that touches robots.txt, after any framework upgrade (Next.js, Rails, WordPress plugins all manage this file differently), and on a quarterly cadence regardless. It takes 15 minutes for a small site, 30 for a larger one.
- Fetch and read the actual file. Don't trust your CMS preview - fetch
https://yourdomain.com/robots.txtdirectly in a browser or viacurl. Confirm what's actually being served, not what's in your config. The two diverge surprisingly often. - Verify it returns 200. A 404 on
/robots.txtis fine - Googlebot treats it as "no rules". A 5xx is not. If you don't have a robots.txt file, serve a 404 cleanly - don't let your CMS render an HTML error page with a 200 status code. - Check for blocked rendering assets. Search the file for
.css,.js,/static/,/_next/,/wp-content/,/wp-includes/. None of these should be in a Disallow rule. Use Search Console's URL Inspection tool on a few key pages and check the "Page resources" section for blocked resources. - Confirm the Sitemap declaration is present and current. The
Sitemap:line should point to a URL that itself returns 200 and contains your latest content. If you renamed your sitemap path during a migration, update robots.txt the same day. - Test every wildcard rule against three URLs. Use the embedded checker above. Pick one URL you expect to be blocked, one you expect to be allowed, and one edge case (a query-string URL, a file-extension match, a trailing-slash variant). All three should match your expectation.
- Cross-reference with Search Console. Open the "Pages" report → "Why pages aren't indexed" → "Blocked by robots.txt". Every URL listed here is a deliberate block - confirm you actually meant to block it. Anything in the list that looks like a legitimate page that should be ranking is your priority fix.
- Spot-check via URL Inspection. Search Console no longer hosts the standalone robots.txt tester, but the URL Inspection tool will tell you whether a specific URL is blocked. Spot-check 5-10 important URLs (top traffic, top conversion, recent launches) and confirm none are unexpectedly blocked.
One thing the tool can't tell you: Crawl-delay doesn't apply to Google
If you've inherited a robots.txt with Crawl-delay: 10 or similar, know that Googlebot ignores it entirely. The directive is honored by Bingbot, Yandex, and a few others, but Google sets crawl rate via Search Console settings (and increasingly auto-throttles based on server response times). Setting Crawl-delay for Google does nothing - neither good nor bad - but it can quietly slow your indexing on Bing if the value is too aggressive.
If you actually need to slow Googlebot down (rare - usually a sign your server can't handle the crawl), use Search Console's crawl rate setting, not robots.txt.
Grab the one-page audit checklist
A printable version of the 15-minute audit above, plus copy-pasteable starter robots.txt templates for WordPress, Next.js, and static sites - pre-configured to allow rendering assets and declare sitemaps correctly.
Quick quiz: are you ready to audit your own robots.txt?
Five questions, takes two minutes. We'll show you the right answer and a one-line explanation after each one.
robots.txt - quick check
5 randomized questions drawn from a pool of 12. Different every time you take it. Takes about two minutes.
Next up in Technical SEO
You've covered status codes (the entry-point signal) and robots.txt (what gets crawled). The rest of the Technical SEO pillar:
- XML sitemaps - why a broken one is worse than none, and how to keep yours current as your content grows.
- Crawlability vs indexability - the two checks Google makes before your page can rank, and why a page can pass one and fail the other.
- Mixed content and HTTPS - the audit that takes 30 seconds and often reveals a year's worth of debt.