AI Crawler Access: How to Control GPTBot, ClaudeBot, and Google-Extended

Learn how to configure robots.txt for GPTBot, ClaudeBot, Google-Extended, and PerplexityBot to control AI training data and maximise citation visibility.

AI crawler access is controlled by your robots.txt file. The major AI companies, including OpenAI (GPTBot), Anthropic (ClaudeBot), Google (Google-Extended), and Perplexity (PerplexityBot), all respect robots.txt directives when crawling your site for training data and real-time retrieval. Blocking a crawler removes your content from that company’s training pipeline and, in the case of retrieval-based engines like Perplexity, from its live search results too.

The decision matters more than most site owners realise. Allowing GPTBot means your content can influence ChatGPT’s training data. Allowing PerplexityBot means your pages can appear as cited sources when users search. Blocking Google-Extended does not hurt your Google Search rankings, but it does reduce your chances of appearing in Gemini responses. Each crawler has a distinct impact on where your brand shows up, so the right configuration depends on your visibility goals, not a blanket block or allow.

This page covers every major AI crawler, the exact robots.txt syntax to control each one, and how your choices affect both Google Search rankings and AI citation rates across ChatGPT, Perplexity, Gemini, and Copilot.

What AI crawlers exist and what do they do

There are two types of AI crawlers, and the distinction changes how you handle them. Training crawlers (GPTBot, ClaudeBot, Google-Extended) collect content to build or refine a language model. Retrieval crawlers (PerplexityBot, ChatGPT-User) fetch content in real time to answer user queries right now. Blocking a training crawler affects future model behaviour. Blocking a retrieval crawler cuts off live citations immediately.

Here are the crawlers you need to know:

CrawlerOperatorTypeUser Agent Token
GPTBotOpenAITrainingGPTBot
ChatGPT-UserOpenAIRetrieval (live)ChatGPT-User
ClaudeBotAnthropicTrainingClaudeBot
Google-ExtendedGoogleTrainingGoogle-Extended
PerplexityBotPerplexityRetrieval (live)PerplexityBot

The full verified user agent strings from each company’s documentation:

  • GPTBot: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)
  • ChatGPT-User: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
  • ClaudeBot: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
  • PerplexityBot: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
  • Google-Extended: operates as a robots.txt product token only; it does not send a separate HTTP user agent string

How AI crawler access affects Google Search rankings and AI citations

Allowing or blocking AI crawlers has different effects on your two visibility surfaces. For traditional Google Search rankings, the only crawler that matters is Googlebot. According to Google’s developer documentation, blocking Google-Extended does not impact your inclusion in Google Search results and is not used as a ranking signal. You can block every AI training crawler without touching a single Search position.

The AI citation surface is different. When PerplexityBot can access your pages, Perplexity’s retrieval system can pull from them live. When GPTBot and ClaudeBot access your site, your content feeds into the training data that shapes how ChatGPT and Claude respond to queries, including whether they mention your brand. The mechanism is indirect for training crawlers (the model learns from your content), but brand familiarity effects are real over time.

Google-Extended sits in its own category. Blocking it prevents your content from being used to train Gemini models and the Vertex AI generative API. It does not block Google AI Overviews, which are powered by Googlebot crawls. If your goal is to appear in AI Overviews, blocking Google-Extended does nothing to hurt or help that. Standard search indexing and content quality are what matter there.

For brands trying to get cited by AI engines, the practical implication is this: allow retrieval crawlers by default unless you have a specific legal or competitive reason not to. Training crawlers are a judgment call based on your data licensing stance.

robots.txt syntax for every major AI crawler

Each AI crawler checks your robots.txt at the root of your domain (e.g. https://yourdomain.com/robots.txt) before crawling. According to Google’s robots.txt documentation, the User-agent field is case-insensitive, but path values are case-sensitive.

Allow all AI crawlers (or let a specific one through):

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

Block a single crawler entirely:

User-agent: GPTBot
Disallow: /

Block all major AI crawlers at once:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

Allow AI crawlers to access most paths but protect sensitive areas:

User-agent: GPTBot
Disallow: /api/
Disallow: /account/
Allow: /

One important note on Google-Extended: because it does not have a separate HTTP user agent string (it uses existing Google crawler infrastructure), you cannot block it via server-side header filtering. The only mechanism that works is the robots.txt product token shown above. This is confirmed in Google’s own developer documentation.

The case for allowing AI crawler access

The strongest argument for keeping AI crawlers enabled is citation opportunity. PerplexityBot and ChatGPT-User are retrieval crawlers: when a user asks a question and your page is the best answer, these crawlers pull your content into the response. Block them and you cannot be cited, no matter how good your content is.

Training crawlers are less direct. GPTBot collects data to improve future ChatGPT models. Allowing it does not guarantee your brand gets mentioned in responses, but blocking it means your content is absent from future training cycles, which affects how the model perceives your category and your brand over time.

From a generative engine optimisation perspective, allowing all five major AI crawlers is the default recommendation for most businesses. Exceptions include:

  • Sites with sensitive, legally protected, or proprietary content (medical protocols, financial models, legal briefs)
  • Publishers who are pursuing or have reached licensing agreements with AI companies and do not want to give free access in the meantime
  • Sites with significant user-generated content where contributor terms do not permit sublicensing to AI training datasets

Verifying your robots.txt configuration is working

Publishing a robots.txt rule does not guarantee compliance. It relies on crawlers respecting it. The major AI companies publicly commit to honouring these directives. Anthropic publishes its ClaudeBot IP addresses at https://claude.com/crawling/bots.json for independent verification.

Steps to verify your setup:

  1. Visit https://yourdomain.com/robots.txt in a browser to confirm your rules are live and correctly formatted.
  2. Use Google Search Console’s robots.txt Tester. Enter Google-Extended as the user agent and a representative URL to confirm you are not accidentally blocking Googlebot along with it.
  3. Filter your server access logs by the user agent strings listed above. Presence after a Disallow directive indicates the crawler is not complying.
  4. Cross-reference ClaudeBot visits against Anthropic’s published IP list to confirm compliance.

Tracking your AI visibility over time is the best way to confirm whether your access decisions are actually affecting citation rates. A platform like Fokal runs queries across ChatGPT, Perplexity, and Gemini to show you whether your brand appears, and when your robots.txt changes translate into real citation shifts.

AI crawler access and the dual Google + AI citation opportunity

Allowing AI crawlers is one input into citation rates, but it is not sufficient on its own. Perplexity and ChatGPT live retrieval cite sources based on relevance and authority, not just access. A page that is accessible but thin, slow, or unstructured will lose to a competitor whose page is also accessible but far better.

The factors that drive AI search ranking alongside crawler access include clear direct answers near the top of the page (which AI systems extract as snippets), structured content with proper heading hierarchy, and external signals establishing topical authority. These are the same factors that improve Google Search rankings, which matters because Bing powers Microsoft Copilot’s web retrieval, and strong search presence across both Google and Bing feeds into AI citation rates across all five major engines.

The practical sequence for brands building AI search visibility: audit your robots.txt first to remove unintended blocks, then build content quality to improve citation rates. Access is the floor. Content quality is the ceiling. Both matter, and fixing access is faster.

The AI SEO hub covers the full picture, from technical access controls like this page to content frameworks, schema markup, and engine-specific optimisation for ChatGPT, Perplexity, Gemini, and Copilot. Fokal monitors your brand across all major AI engines and surfaces the specific gaps worth closing.

How to audit and fix AI crawler access: step-by-step

Follow these steps to audit your current configuration and close any unintended blocks.

Step 1: Check your current robots.txt

Open a browser and navigate to https://yourdomain.com/robots.txt. Look for any User-agent: * block with a broad Disallow: /: this blocks every crawler including AI bots. Also look for explicit blocks on GPTBot, ClaudeBot, Google-Extended, PerplexityBot, or ChatGPT-User.

Step 2: Identify what you actually want to control

Separate your decisions by crawler type. For retrieval crawlers (PerplexityBot, ChatGPT-User), allowing access directly affects live citation rates. For training crawlers (GPTBot, ClaudeBot, Google-Extended), the impact is on future model training. Decide on each independently rather than applying one rule to all.

Step 3: Edit your robots.txt file

Add explicit User-agent blocks for each AI crawler you want to control. If you want to allow all by default and only block certain paths, use Allow: / combined with specific Disallow: entries for sensitive directories. If your CMS manages your robots.txt (WordPress, Shopify, Webflow), update it through the platform’s SEO settings panel rather than editing the file directly.

Step 4: Verify with Google Search Console

Use the robots.txt Tester in Google Search Console to confirm your rules behave as intended. Test Googlebot to verify search crawling is unaffected. Test Google-Extended to verify the AI training block is in place if that is your intent.

Step 5: Monitor access logs for compliance

After publishing your updated robots.txt, check your access logs over the following weeks. Filter on the user agent strings for each crawler. If you see traffic from a crawler you have blocked, cross-reference the source IP against the published lists (Anthropic’s list is at https://claude.com/crawling/bots.json).

Step 6: Track citation changes over time

Set up AI visibility tracking for your target queries across ChatGPT, Perplexity, and Gemini. Changes to your robots.txt take effect as crawlers revisit your site, which can take days to weeks. Monitoring citation rates before and after gives you a measurable signal of whether the configuration is working.

Eight minutes to something you can ship.