Should I block GPTBot?

Only if you have a strong reason. Blocking GPTBot removes your content from future OpenAI model training. If you want your brand represented in base-model responses over the long term, leave it on at least for your educational content.

What is the difference between GPTBot and ChatGPT-User?

GPTBot crawls for training. ChatGPT-User fetches pages at query time when a user clicks a browse action or the model retrieves live data. Blocking ChatGPT-User removes you from live ChatGPT citations.

Does Google-Extended affect my regular Google rankings?

No. Google-Extended controls whether your content feeds Gemini and Google AI Overviews. Regular Googlebot and Googlebot-Image are separate. You can block Google-Extended and still rank normally in search.

Is noindex the same as blocking AI crawlers?

No. Noindex stops a URL from appearing in search results, but it does not prevent the bot from reading the page and using it for training or retrieval. Use robots.txt and X-Robots-Tag for AI-specific control.

How often should I update my robots.txt for new AI crawlers?

Review quarterly. New bots (like ClaudeBot, AppleBot-Extended) show up every few months. Keep a short list of known AI crawler user agents and add rules proactively rather than reactively.

← Back to blog

Field notes

Robots.txt for AI: What to Block and What to Allow

How do you configure robots.txt to keep AI retrieval bots like PerplexityBot active while limiting training crawlers such as GPTBot and CCBot?

James Calloway·April 18, 2026

Robots.txt for AI: What to Block and What to Allow

The default advice on AI crawlers is wrong: block everything until you figure it out. Follow that playbook and you also remove your brand from the retrieval pool feeding ChatGPT citations, Perplexity results, and Google AI Overviews. The correct starting position is the opposite. Allow retrieval bots (PerplexityBot, ChatGPT-User, Google-Extended) on high-intent marketing content, allow training bots (GPTBot, ClaudeBot, CCBot) on educational content only if you want to influence future model weights, and block both on anything gated, commercial IP, or user-submitted. Your robots.txt is a three-way policy: train, retrieve, block.

The Two Classes of AI Crawler You Need to Separate

Most robots.txt guidance treats AI bots as one category. They are not.

Training crawlers ingest your content to improve future model versions. Examples: GPTBot, ClaudeBot, CCBot, Google-Extended for Gemini training. They read you now, you might show up in model weights six to twelve months later.
Retrieval crawlers fetch your page at query time to ground a specific answer. Examples: PerplexityBot, ChatGPT-User, Claude-Web, Google-Extended for AI Overview grounding. Block these and you vanish from the citation response immediately.

Blocking retrieval bots is almost always wrong for marketing pages. Blocking training bots is a defensible choice with tradeoffs.

The Allowlist That Works for Most Brands

Start with the stance that you want maximum retrieval visibility and selective training exposure. That translates into a specific pattern.

``` User-agent: PerplexityBot Allow: /

User-agent: ChatGPT-User Allow: /

User-agent: Google-Extended Allow: /

User-agent: ClaudeBot Allow: /blog/ Allow: /guides/ Allow: /resources/ Disallow: /

User-agent: GPTBot Allow: /blog/ Allow: /guides/ Allow: /resources/ Disallow: /

User-agent: CCBot Disallow: / ```

Retrieval bots get the whole site. Training bots see only educational content, not pricing, dashboards, or proprietary surfaces. CCBot is blocked because Common Crawl feeds almost every open-weight model and you usually want explicit deals, not blanket ingestion.

The Tradeoff Most Brands Miss

Diagram showing two AI crawler paths, training on the left feeding future model weights and retrieval on the right grounding live answers, with a central site deciding which lane to allow

The diagram above is the real decision. Training exposure bets on being cited in base-model responses in the future. Retrieval exposure bets on citations right now, on every query. Most marketing teams benefit more from the retrieval lane yet block both out of caution.

If you care about AI visibility this quarter, retrieval is the priority. Blocking it to "protect content" is usually a mistake dressed up as policy.

What to Actually Block

Some parts of your site should be off limits to any AI crawler.

Authenticated and gated surfaces: `/app/`, `/dashboard/`, `/account/`, `/checkout/`.
Proprietary documentation you sell access to. If you charge for the docs, do not let GPTBot train on them.
User-generated content where you do not own the rights.
Staging and preview URLs. Pre-production content leaking into training data is a compliance risk.
PII-containing URLs. Account pages, invoices, support tickets.

Our guide on how AI models crawl web content covers how these bots behave once you let them in.

Verify the Bot Before You Trust the User Agent

User agent strings are trivial to spoof. OpenAI, Anthropic, and Perplexity publish IP ranges for their crawlers. Validate inbound requests against those ranges at the edge (Cloudflare, Fastly, or a WAF rule) before serving content. Treat any claim of being GPTBot from an unverified IP as a scraper.

Combine Robots.txt with the `X-Robots-Tag` Header

Robots.txt is coarse. For URL-level granularity, send the `X-Robots-Tag: noai, noimageai` response header on specific pages you want excluded from AI ingestion without changing your robots policy. This is the cleanest way to exempt individual resources (a single whitepaper, a specific product page) without rewriting your entire allowlist.

Our technical SEO service sets up both layers for clients who need per-URL control.

The Mistake That Kills AI Citations

The most common mistake is copying an old SEO example that blocks `User-agent: *` from `/blog/`. That one line hides your highest-quality content from every retrieval bot. If AI citation rate has fallen with no obvious cause, check the blog directory rule first.

Then curl a few URLs with a bot user agent. If a PerplexityBot fetch returns your blog posts cleanly, you are in the retrieval pool.

Where to Start This Week

Audit your robots.txt against the allowlist pattern above
Decide explicitly: retrieval yes or no, training yes or no, per URL class
Add IP-based verification for any bot you allow by user agent
Fetch your top ten landing pages as PerplexityBot and confirm they render
Run the rest of the technical GEO checklist to make sure crawler access is paired with renderable, citable content

For a second pair of eyes on your crawler policy, explore our technical SEO service.

Frequently asked questions

Robots.txt for AI: What to Block and What to Allow

The Two Classes of AI Crawler You Need to Separate

The Allowlist That Works for Most Brands

The Tradeoff Most Brands Miss

What to Actually Block

Verify the Bot Before You Trust the User Agent

Combine Robots.txt with the `X-Robots-Tag` Header

The Mistake That Kills AI Citations

Where to Start This Week

Frequently asked questions

Free GEO tools

Ready to grow your AI visibility?

Recent field notes.

The Best Fintech Marketing Agencies in 2026

The Best Insurance Marketing Agencies in 2026

Multi-Language GEO: International AI Visibility Strategy

Stay ahead in AI search