AI Crawler Checker

Find out which AI search engines can see your website. Enter your domain to check whether GPTBot, PerplexityBot, Google-Extended, ClaudeBot, and other AI bots are allowed or blocked by your robots.txt.

How to use the AI Crawler Checker

Checking your AI bot visibility takes under a minute. Enter your root domain, click Analyze, and the tool parses your robots.txt in real time and maps every directive against a curated database of AI crawler user-agents.

  1. Enter your domain. Type your root domain (e.g. example.com) into the input field. No need for https:// or a path — just the bare domain.
  2. Click Analyze. The tool fetches your robots.txt and processes each User-agent block, handling wildcards, specific agents, and Disallow/Allow directives with crawl-delay awareness.
  3. Review your results. The report lists every major AI bot with a clear status — Allowed (green), Blocked (red), or Partially restricted (amber) — and shows the exact directive responsible for each decision.
  4. Take action. If key crawlers are blocked, update your robots.txt to allow them, then re-run the checker to confirm your changes have taken effect. For selective access, write individual User-agent rules for each bot rather than relying on a blanket wildcard.

What is robots.txt and why it matters for AI visibility

The robots.txt file is a plain-text document hosted at the root of your domain (e.g. https://example.com/robots.txt). It follows the Robots Exclusion Protocol, a decades-old standard that lets website owners tell compliant crawlers which pages or directories they may and may not fetch. Most search engines and data scrapers check this file before crawling any other page on your domain.

The file is structured as a series of blocks, each beginning with aUser-agent line that names the bot the rules apply to, followed by one or more Disallow or Allow directives. A wildcard (User-agent: *) applies to any bot not covered by a more specific rule. Directives are evaluated in order, and the first matching rule wins for most crawlers.

For traditional SEO, robots.txt was primarily a tool for managing crawl budget — preventing Googlebot from wasting time on admin pages, duplicate content, or internal search result URLs. For AI visibility, it has taken on a far more consequential role: it is the primary mechanism that determines whether AI-powered search engines and answer engines can read and cite your content at all.

As of 2025, every major AI company publishes its crawler user-agent name and asks sites to use robots.txt to express their preferences. OpenAI uses GPTBot for model training and ChatGPT-User for its browsing plugin. Google introduced Google-Extended specifically to let publishers control their participation in Gemini and AI Overviews independently of core Google Search. Anthropic runs both ClaudeBot (web indexing) and Anthropic-AI. Perplexity runs PerplexityBot. Apple uses Applebot-Extended for Apple Intelligence features.

The practical implication is straightforward: if your robots.txt blocks these agents — even unintentionally through an overly broad wildcard rule — your pages will not appear in AI-generated answers, summaries, or citations from those platforms, regardless of how authoritative, well-structured, or keyword-optimised your content is. Fixing a robots.txt misconfiguration is often the single highest-leverage GEO action a brand can take.

How robots.txt affects AI search engines

Traditional search engines like Google use crawled content to build an index that ranks pages in response to queries. AI search engines and answer engines go further: they use crawled content to generate direct answers, summaries, and citations within the AI's response. Being cited by ChatGPT, Perplexity, or Google AI Overviews requires your content to have been fetched, processed, and deemed relevant — none of which can happen if the crawler was turned away at the robots.txt gate.

The relationship between robots.txt and AI visibility is not always obvious because different bots serve different purposes for the same platform. For Google, Googlebot still powers organic search results, while Google-Extended controls eligibility for Gemini and AI Overview responses. A site that blocks Google-Extended but not Googlebot will rank in traditional search but disappear from AI-generated answers — a gap that many site owners don't discover until they wonder why their competitors appear in AI responses and they don't.

For OpenAI, GPTBot is used to update training data and improve model quality, while ChatGPT-User powers the browsing feature that retrieves live information during a conversation. Blocking GPTBot may be acceptable for publishers concerned about training data use, but blocking ChatGPT-User additionally prevents ChatGPT from citing your site in real-time answers. Most publishers choose to allow ChatGPT-User while optionally blocking GPTBot if they object to training use.

Common Crawl (CCBot) operates a freely available web archive that many AI companies use as a training data source. Blocking CCBot reduces the likelihood of your content appearing in models trained on Common Crawl data — a broad category that includes many open-source and commercial LLMs. Allowing it is generally recommended for brands that want maximum presence across the AI ecosystem, though the impact is harder to measure directly than allowing platform-specific bots like GPTBot or PerplexityBot.

Common robots.txt mistakes that hurt AI visibility

Most AI visibility problems caused by robots.txt are unintentional. These are the five patterns we see most often when auditing brands that are invisible in AI-generated answers:

  • Blanket wildcard Disallow. A rule like User-agent: * / Disallow: / blocks every bot not explicitly allowed elsewhere. If AI crawlers don't appear in an explicit Allow block, they are shut out entirely. This pattern is common in staging sites that were later promoted to production without updating the robots.txt.
  • Outdated security tool configurations. WAFs, DDoS protection layers, and security plugins sometimes auto-generate robots.txt rules that block unfamiliar user-agents. AI crawlers (especially newer ones like Applebot-Extended or PerplexityBot) are frequently caught by these catch-all deny rules.
  • Blocking the wrong Google bot. Teams that want to protect content from AI use sometimes block Googlebot, not realising that Googlebot powers organic search while Google-Extended is the agent for Gemini and AI Overviews. Blocking the wrong agent harms SEO without achieving the intended AI opt-out.
  • Missing or incorrect user-agent names. Robots.txt directives are case-sensitive for user-agent matching in some implementations. Writing gptbot instead of GPTBot, or misspelling PerplexityBot, means the rule is ignored entirely — either allowing a bot you wanted to block, or failing to allow one you intended to permit.
  • Not updating robots.txt as new AI bots emerge. New AI crawlers appear regularly. Sites that wrote their robots.txt rules in 2023 likely don't include directives for bots that launched in 2024 and 2025. Without explicit rules, these bots fall through to the wildcard — which may block or allow them depending on your existing configuration.

Frequently asked questions

Frequently asked questions

Go beyond diagnostics

These tools show you the gaps. We fix them.

Get a full AI visibility audit across ChatGPT, Perplexity, Gemini, and Google AI Overviews — or talk to our team about a hands-on engagement.