How Generative AI Retrieves and Cites Sources

When a user asks ChatGPT, Perplexity, or Google AI Overviews a question, the system does not simply return the top search result. It retrieves relevant text passages from a pre-built index, synthesizes a response from those passages, and attributes specific claims to their source pages. This architecture - called Retrieval-Augmented Generation (RAG) - is the mechanism behind every AI citation decision. Understanding it precisely is the foundation of GEO.

Audit whether AI crawlers can reach your content

Before any retrieval can happen, AI crawlers must be able to access and index your pages. Use the crawlability checker to verify that your key pages are accessible to crawlers generally, then manually verify robots.txt permissions for AI-specific user-agents.

Try it inline

Page Crawlability Checker

Check whether a URL is accessible to web crawlers.

Open full tool

Loading tool…

Confirm the page is in Google's index

Most AI systems use Google's index as a discovery and quality layer. A page that fails Google's indexability criteria is unlikely to appear in any AI retrieval system's knowledge base.

Try it inline

Page Indexability Checker

Check whether a URL is indexed and free of noindex signals.

Open full tool

Loading tool…

How RAG architecture works

RAG systems operate in two distinct phases. In the retrieval phase, the system uses the user's query to search a pre-built vector index of text passages and retrieve the most semantically relevant ones. In the generation phase, those passages are fed as context to a language model, which synthesizes a natural-language response and attributes specific claims back to their source passages.

The critical insight: your content is not retrieved as a page - it is retrieved as passages. A page with 2,000 words may be split into 8-12 passages of 150-300 words each. The RAG system scores each passage separately against the query. Only the highest-scoring passages are included as context for the generated response. A page can be indexed and high-ranking but have zero high-scoring passages if the content is structured in ways that resist extraction.

AI crawler user-agents and robots.txt

Each major AI platform runs its own crawler with a distinct user-agent string. GPTBot (OpenAI) is used for ChatGPT citation and training. ClaudeBot (Anthropic) is used for Claude's retrieval system. PerplexityBot (Perplexity) is used for Perplexity's real-time search index. Googlebot-Extended (Google) is used for AI Overviews and Gemini.

A robots.txt rule that targets generic bot patterns (like `Disallow: /` for all non-Googlebot crawlers) will block all of these simultaneously. Many sites have this configuration unintentionally, introduced when site administrators tried to prevent scraping after OpenAI published GPTBot's user-agent string in 2023. The block prevents both training data collection AND citation-index crawling - two separate functions that used the same user-agent.

The right approach: audit your robots.txt for rules that affect GPTBot, ClaudeBot, PerplexityBot, and Googlebot-Extended. Decide consciously whether to allow or disallow each. The default should be allow unless you have a specific reason - a page with confidential commercial information, for example.

JavaScript rendering and AI crawlers

Most AI crawlers do not execute JavaScript. Content that is injected client-side after initial HTML load is invisible to these systems. A single-page application (SPA) that renders its main content via React or Vue after a JS bootstrap will appear as a nearly empty page to non-JS crawlers. The audit move: use the crawlability checker to view the raw HTML response and confirm your key content paragraphs appear in the initial HTML, not after JS execution.

The biggest mistake: assuming Googlebot access means AI access

Googlebot and AI crawlers are distinct systems with separate user-agent strings and separate permissions. A site can be fully indexed by Google while blocking every AI retrieval crawler. The two most common causes are: (1) a robots.txt rule added specifically to block GPTBot in 2023 that was never revisited, and (2) a CDN-level bot protection rule (common in Cloudflare configurations) that blocks non-browser user-agents as an anti-scraping measure.

The CDN rule scenario is particularly easy to miss because robots.txt checking tools won't catch it - the block happens before the request even reaches robots.txt. The symptom: page is crawlable by Googlebot, passes robots.txt, but is never cited by any AI platform. The fix: review bot protection rules in the CDN configuration and whitelist AI crawler user-agents explicitly.

What clean AI retrieval readiness looks like

Open your robots.txt and search for each AI crawler user-agent: GPTBot, ClaudeBot, PerplexityBot, Googlebot-Extended. Confirm the rules are intentional.
Use the Page Crawlability Checker above on your 5 most important informational pages. Confirm all return accessible status.
Use the Page Indexability Checker on the same pages. Confirm each is indexed with no noindex signals.
View the raw HTML source of one page (right-click, View Source). Confirm your main content paragraphs appear in the HTML, not just in JS-rendered DOM.
Check your CDN bot protection configuration. Explicitly whitelist GPTBot, ClaudeBot, and PerplexityBot if generic bot blocking rules are in place.
Query Perplexity with a question your best-indexed page should answer. If it is not cited despite passing the steps above, the issue is likely passage-level structure rather than access (see the next article in this pillar).

Checklist

AI retrieval and crawlability DOs & DON'Ts

Allow AI crawlers by default and disallow only where you have a specific reason
GPTBot, ClaudeBot, PerplexityBot, and Googlebot-Extended should be allowed unless you have a conscious business reason to exclude a specific platform. The default should be open.
Verify your robots.txt allows each AI crawler separately
Each AI crawler uses a different user-agent string. A rule targeting one doesn't automatically apply to others. Test each against your robots.txt before assuming all are allowed.
Structure content in short, self-contained paragraphs
RAG systems chunk pages into ~200-500 word passages. Each paragraph should stand alone as a complete thought - not require the preceding paragraph for context.
Use semantic HTML that helps AI chunking
H2 and H3 tags act as natural passage boundaries for AI chunking systems. A well-structured heading hierarchy helps AI systems identify where one topic ends and another begins.
Ensure your content is fully rendered server-side or as static HTML
Most AI crawlers do not execute JavaScript. Content that relies on client-side rendering to appear is invisible to these crawlers regardless of your robots.txt permissions.

DON'T

Don't assume Googlebot access means AI crawler access
Googlebot and AI crawlers (GPTBot, ClaudeBot) are distinct. A site can be fully indexed by Google while blocking every AI retrieval crawler.
Don't hide key content behind login walls expecting AI citation
Gated content is invisible to AI crawlers. If you want specific content to be cited by AI systems, it must be publicly accessible.
Don't use JavaScript to dynamically inject important textual content
If the key paragraph of a page is injected by a JS framework after initial HTML render, AI crawlers that don't execute JS will never see it.
Don't ignore noindex signals on pages you want cited
A noindex tag prevents Googlebot from indexing a page - and most AI systems use Google's index as a discovery layer. A noindexed page is unlikely to be in any AI retrieval system's knowledge base.
Don't use aggressive bot-blocking CDN rules without reviewing AI crawler coverage
Cloudflare bot protection, firewall rules, and rate-limiting configurations that block automated crawlers may silently block AI systems. Review these configurations with AI crawler user-agent strings in mind.

Free eBook

Grab The SEO Blueprint.

How to get found on Google, get cited by AI, and attract customers on autopilot - a practical guide for business owners and entrepreneurs.

Keyword research and on-page SEO tactics
Technical SEO and link building strategies
A 90-day SEO action plan

The SEO Blueprint - free eBook by Shammika Munugoda

Quick quiz · 5 questions

Generative AI retrieval - quick check

5 randomized questions drawn from a pool of 10. Different every time you take it. Takes about two minutes.

Next in the GEO pillar

How to Structure Content for AI Citation - once access is confirmed, passage-level writing is the next lever.
How to Build Brand Entity Authority for AI Knowledge Graphs - entity signals determine citation confidence once passages are being retrieved.