How to Audit Crawlability and Indexability for SEO - A Practical Playbook

The "page is published but never appears in Google" debugging game has two phases: can Googlebot crawl the page, and if so, will Google index it? Most teams treat these as the same problem and chase the wrong fix. The page that returns 200 OK but never appears in search isn't usually a "Google penalty." It's almost always one of half a dozen mechanical issues sitting in the gap between crawled and indexed.

This guide skips the per-page mechanics (the two tools below cover those) and focuses on the mental model: the two gates Google walks through, the order they run in, and the audit workflow for fixing pages stuck on the wrong side of either gate.

The two gates: crawlability vs indexability

The single mental-model fix that closes most "why isn't this page ranking?" tickets:

Crawlability - Can Googlebot fetch this URL? Determined by robots.txt, the HTTP status code, authentication walls, server availability, and discoverability via internal links or the sitemap.
Indexability - Will Google include this URL in the index? Determined by the noindex directive (meta tag or X-Robots-Tag header), the canonical tag, content quality, near-duplicate detection, and Google's render-time evaluation.

Crawl runs first. A page can be crawled but not indexed (Google fetches it, then chooses not to include it - this is the "Crawled - currently not indexed" status in Search Console, and it's usually a quality signal). A page cannot be indexed without first being crawled by you - but if you block crawling, the URL can still appear in Google's index based on external links, with the gray "no information available" snippet.

That last point is the most expensive misunderstanding on this topic. robots.txt Disallow blocks the crawl, not the index. If you want a URL out of search results, the directive is noindex, served on a page Google is allowed to crawl. Counterintuitive, but it's the rule that fixes the bug.

Check crawlability for any URL

Paste a URL below to see whether Googlebot can crawl it: HTTP status, robots.txt rules that apply, and the meta-robots / X-Robots-Tag directives served by the page.

Try it inline

Page Crawlability Checker

Check whether Googlebot can crawl a page - inspects robots.txt, meta robots, and X-Robots-Tag headers. No login, works on any domain.

Open full tool

Loading tool…

Check indexability for any URL

Now run the same URL through the indexability checker. The questions are different: is there a noindex tag? Does the canonical point at this URL or somewhere else? Does the HTTP status allow indexing?

Try it inline

Page Indexability Checker

Check whether a page is indexable - noindex meta, canonical tags, and HTTP status. No login, works on any domain.

Open full tool

Loading tool…

The directive decision tree

Three directives are commonly used to control crawl and index behaviour, and they do different things. Pick the wrong one and the URL ends up indexed when you wanted it hidden, or hidden when you wanted it indexed.

"I want this URL out of search results"

Use <meta name="robots" content="noindex"> (or the X-Robots-Tag: noindex response header for non-HTML responses like PDFs). Leave robots.txt allowing the crawl - Googlebot has to be able to fetch the page to see the noindex tag.

For permanent removals (deleted pages, deactivated accounts), pair the noindex with a 410 Gone status code. The combination drops URLs from the index faster than noindex alone.

"I want to save crawl budget on infinite low-value URLs"

Use robots.txt Disallow. Faceted filter combinations, internal search results, and parameter-decorated URLs (millions of /products?color=red&size=L&page=4-style variants) are the legitimate use case. Googlebot doesn't waste cycles fetching them, so it has more budget for your real pages.

Caveat: Disallowed URLs can still appear in search results if they're externally linked. Combine with noindex only if you can ensure the URLs aren't linked from anywhere - otherwise the URL shows up with no description.

"I have multiple URLs serving the same content; pick one as the master"

Use <link rel="canonical" href="...">. Set every indexable page's canonical to itself (self-referencing). For variants - parameterised URLs, paginated archive pages, AMP versions - set the canonical to the master URL. Google consolidates ranking signals onto the canonical and treats the variants as duplicates.

Audit move: open Search Console URL Inspection on any URL whose ranking is unclear, and look at the "Canonical" section. If "Google-selected canonical" differs from "User-declared canonical," Google has overridden your hint - usually because stronger signals (more inbound links, better authority) point at the URL Google chose.

Checklist

Crawlability & indexability DOs & DON'Ts

Use noindex (with crawl allowed) to keep a page out of the index
noindex is the only directive that actually prevents indexing. Googlebot has to be able to crawl the page to see the tag - so leave robots.txt open for that URL.
Make every canonical tag self-referencing on indexable pages
If /blog/post canonicalises to /blog/post, you've stated the intent clearly. Forgotten canonicals (or ones pointing at the homepage from a template default) silently consolidate the wrong URLs.
Treat "Crawled - currently not indexed" as a quality signal
Search Console is telling you it crawled the page but chose not to include it. That usually means thin content, near-duplicate of another page on your site, or low-trust signals. Fix the page, then ask for re-indexing.
Use Search Console URL Inspection on every "why isn't this ranking?" question
It tells you whether the URL is crawled, indexed, allowed by robots.txt, blocked by noindex, has a different canonical chosen by Google, and what the rendered HTML looks like. This single tool answers 80% of crawl/index questions.
Verify mobile and desktop versions render the same content
Google indexes the mobile version. If your mobile template hides content behind tabs that don't render in initial HTML, Google may not index it. Fetch as Google with the mobile crawler and compare.

DON'T

Don't use robots.txt Disallow to remove a URL from search results
Disallow stops the crawl, not the index. If the URL is linked anywhere, it can still appear in search with the gray "no information available" snippet. Use noindex instead.
Don't combine robots.txt Disallow with a noindex tag on the same URL
Self-defeating. Googlebot can't fetch the page to see the noindex if Disallow blocks the fetch. Pick one - usually noindex with crawl allowed.
Don't ignore "Discovered - currently not indexed"
It means Google found the URL (probably from a sitemap or internal link) but hasn't crawled it yet. Almost always a sign of insufficient crawl budget or weak internal linking. The fix is usually internal links from high-authority pages, not a re-submit.
Don't rely on noindex to drop URLs from the index quickly
Google has to recrawl the page to see the new noindex tag. For permanent removals, 410 Gone in addition to noindex usually drops URLs faster than noindex alone.
Don't trust "Indexed" in Search Console without verifying the canonical
Google sometimes indexes a different URL than the one you submitted ("Google chose different canonical"). The submitted URL gets credit only if the canonical and indexed URL match.

The biggest mistake: using robots.txt to deindex

This is the single most common "why is my page still in Google?" ticket on every SEO forum. The pattern: a developer adds a URL to robots.txt Disallow expecting it to disappear. A week later it's still in the index, often with the gray "no information is available for this page" snippet. The page is then sometimes also noindexed in panic, which makes things worse - because Disallow now prevents Googlebot from crawling the page to see the noindex.

The correct fix decision:

Remove the Disallow rule from robots.txt. Googlebot needs to fetch the page.
Add <meta name="robots" content="noindex"> to the page (or X-Robots-Tag: noindex for non-HTML).
Wait for Googlebot to recrawl. Speed it up with a Search Console URL Inspection → "Request indexing" pass.
Once the URL drops from the index, you can re-add the Disallow rule if you also want to save crawl budget.

The order matters. If you Disallow before the noindex is seen, Google may keep the URL indexed for months because it can no longer fetch the deindex signal.

What a clean crawl/index audit looks like

Run this whenever pages aren't ranking despite "looking fine," after migrations, after CMS upgrades, and on a quarterly cadence regardless. Takes about 25 minutes for a small site.

Open Search Console → Pages report. Group "Why pages aren't indexed" by reason. The top three categories - "Crawled - currently not indexed," "Discovered - currently not indexed," and "Submitted URL has 'noindex'" - each have a different fix. Don't lump them together.
For "Crawled - currently not indexed": fix the page, not the directive. Google read the page and chose not to index it. That's a quality, near-duplicate, or low-authority signal. Check for thin content, near-duplicate of another page on your own site, or insufficient internal-link authority. Improve the page, then request re-indexing.
For "Discovered - currently not indexed": improve internal linking. Google found the URL but hasn't crawled it yet. Almost always a sign of weak internal-link authority or insufficient crawl budget. Add prominent internal links from your highest-authority pages and the URL gets crawled within days.
For "Submitted URL has 'noindex'": pick one signal. Either remove the URL from your sitemap (you've told Google not to index it, so why submit it?) or remove the noindex tag (you want it indexed after all).
For "Blocked by robots.txt": double-check intent. Anything in this list is a deliberate block. Confirm you actually meant to block each URL. If a legitimate ranking page is here, it's the highest-priority fix.
Spot-check 10 priority URLs via URL Inspection. Pick 10 high-traffic or high-conversion URLs. For each: confirm it's "URL is on Google," confirm "User-declared canonical" matches "Google-selected canonical," confirm the rendered HTML contains the content you expect (the rendered tab, not the source tab). This catches render-time issues that crawl-only checks miss.
Verify the mobile render. Google indexes the mobile version. If your mobile template hides 60% of the body content behind a tab that loads on user click, Google may not index that content. Use URL Inspection's "Test live URL" with the smartphone Googlebot and inspect the rendered HTML.

Free eBook

Grab The SEO Blueprint.

How to get found on Google, get cited by AI, and attract customers on autopilot - a practical guide for business owners and entrepreneurs.

Keyword research and on-page SEO tactics
Technical SEO and link building strategies
A 90-day SEO action plan

The SEO Blueprint - free eBook by Shammika Munugoda

Quick quiz: are you ready to audit your own crawl/index state?

Five questions, takes two minutes. We'll show you the right answer and a one-line explanation after each one.

Quick quiz · 5 questions

Crawlability & indexability - quick check

5 randomized questions drawn from a pool of 12. Different every time you take it. Takes about two minutes.

Next up in Technical SEO

You've covered status codes, robots.txt, sitemaps, and the crawl/index gates. The last piece of the Technical SEO pillar:

Mixed content and HTTPS - the audit that takes 30 seconds and often reveals a year's worth of debt left over from your last platform migration.

How to audit crawlability and indexability