When a user asks ChatGPT, Perplexity, or Google AI Overviews a question, the system does not simply return the top search result. It retrieves relevant text passages from a pre-built index, synthesizes a response from those passages, and attributes specific claims to their source pages. This architecture - called Retrieval-Augmented Generation (RAG) - is the mechanism behind every AI citation decision. Understanding it precisely is the foundation of GEO.
Audit whether AI crawlers can reach your content
Before any retrieval can happen, AI crawlers must be able to access and index your pages. Use the crawlability checker to verify that your key pages are accessible to crawlers generally, then manually verify robots.txt permissions for AI-specific user-agents.
Confirm the page is in Google's index
Most AI systems use Google's index as a discovery and quality layer. A page that fails Google's indexability criteria is unlikely to appear in any AI retrieval system's knowledge base.
How RAG architecture works
RAG systems operate in two distinct phases. In the retrieval phase, the system uses the user's query to search a pre-built vector index of text passages and retrieve the most semantically relevant ones. In the generation phase, those passages are fed as context to a language model, which synthesizes a natural-language response and attributes specific claims back to their source passages.
The critical insight: your content is not retrieved as a page - it is retrieved as passages. A page with 2,000 words may be split into 8-12 passages of 150-300 words each. The RAG system scores each passage separately against the query. Only the highest-scoring passages are included as context for the generated response. A page can be indexed and high-ranking but have zero high-scoring passages if the content is structured in ways that resist extraction.
AI crawler user-agents and robots.txt
Each major AI platform runs its own crawler with a distinct user-agent string. GPTBot (OpenAI) is used for ChatGPT citation and training. ClaudeBot (Anthropic) is used for Claude's retrieval system. PerplexityBot (Perplexity) is used for Perplexity's real-time search index. Googlebot-Extended (Google) is used for AI Overviews and Gemini.
A robots.txt rule that targets generic bot patterns (like `Disallow: /` for all non-Googlebot crawlers) will block all of these simultaneously. Many sites have this configuration unintentionally, introduced when site administrators tried to prevent scraping after OpenAI published GPTBot's user-agent string in 2023. The block prevents both training data collection AND citation-index crawling - two separate functions that used the same user-agent.
The right approach: audit your robots.txt for rules that affect GPTBot, ClaudeBot, PerplexityBot, and Googlebot-Extended. Decide consciously whether to allow or disallow each. The default should be allow unless you have a specific reason - a page with confidential commercial information, for example.
JavaScript rendering and AI crawlers
Most AI crawlers do not execute JavaScript. Content that is injected client-side after initial HTML load is invisible to these systems. A single-page application (SPA) that renders its main content via React or Vue after a JS bootstrap will appear as a nearly empty page to non-JS crawlers. The audit move: use the crawlability checker to view the raw HTML response and confirm your key content paragraphs appear in the initial HTML, not after JS execution.
The biggest mistake: assuming Googlebot access means AI access
Googlebot and AI crawlers are distinct systems with separate user-agent strings and separate permissions. A site can be fully indexed by Google while blocking every AI retrieval crawler. The two most common causes are: (1) a robots.txt rule added specifically to block GPTBot in 2023 that was never revisited, and (2) a CDN-level bot protection rule (common in Cloudflare configurations) that blocks non-browser user-agents as an anti-scraping measure.
The CDN rule scenario is particularly easy to miss because robots.txt checking tools won't catch it - the block happens before the request even reaches robots.txt. The symptom: page is crawlable by Googlebot, passes robots.txt, but is never cited by any AI platform. The fix: review bot protection rules in the CDN configuration and whitelist AI crawler user-agents explicitly.
What clean AI retrieval readiness looks like
- Open your robots.txt and search for each AI crawler user-agent: GPTBot, ClaudeBot, PerplexityBot, Googlebot-Extended. Confirm the rules are intentional.
- Use the Page Crawlability Checker above on your 5 most important informational pages. Confirm all return accessible status.
- Use the Page Indexability Checker on the same pages. Confirm each is indexed with no noindex signals.
- View the raw HTML source of one page (right-click, View Source). Confirm your main content paragraphs appear in the HTML, not just in JS-rendered DOM.
- Check your CDN bot protection configuration. Explicitly whitelist GPTBot, ClaudeBot, and PerplexityBot if generic bot blocking rules are in place.
- Query Perplexity with a question your best-indexed page should answer. If it is not cited despite passing the steps above, the issue is likely passage-level structure rather than access (see the next article in this pillar).
Generative AI retrieval - quick check
5 randomized questions drawn from a pool of 10. Different every time you take it. Takes about two minutes.
Next in the GEO pillar
- How to Structure Content for AI Citation - once access is confirmed, passage-level writing is the next lever.
- How to Build Brand Entity Authority for AI Knowledge Graphs - entity signals determine citation confidence once passages are being retrieved.
