Sitemap Health Checker

Find out whether your sitemap is guiding AI crawlers to every page — or quietly hiding half your content. Enter your domain to check for broken URLs, missing lastmod dates, and structural errors that limit your AI search visibility.

How to use the Sitemap Health Checker

Running a full sitemap audit takes about 30 seconds. Enter your root domain, click Check, and the tool locates your sitemap automatically — first by reading your robots.txt for a declared Sitemap: directive, then by trying the standard /sitemap.xml path. If your site uses a sitemap index, the checker follows each child file and aggregates the results.

  1. Enter your domain. Type your root domain (e.g. example.com) without a protocol or path. The checker handles the rest.
  2. Click Check. The tool fetches and parses your sitemap XML, samples up to 50 URLs for live HTTP status testing, and calculates coverage metrics for lastmod, changefreq, and priority attributes.
  3. Review your health report. The report shows a health score, a stats grid (total URLs, healthy, redirected, broken), lastmod coverage percentage, most recent and oldest entries, and a ranked issue list with error, warning, and info severities.
  4. Fix issues and re-check. Start with errors (remove broken URLs), then warnings (add lastmod, replace redirect targets with final URLs), then informational notes. Re-run the checker after each round of fixes to confirm your health score improves.

What is a sitemap and why it matters for AI visibility

An XML sitemap is a structured file hosted at your domain that lists every URL you want search engines and crawlers to index, along with optional metadata about each page: when it was last modified (lastmod), how often it changes (changefreq), and its relative importance (priority). The Sitemaps protocol, originally published by Google, Bing, and Yahoo in 2006, is now a universal standard supported by every major crawler.

For traditional search engines, sitemaps supplement link-following by surfacing pages that might not have strong inbound links — deep product pages, new blog posts, location-specific landing pages, or content behind faceted navigation. For AI search engines, the role is identical but the stakes are higher: AI crawlers often have tighter crawl budgets than Googlebot, meaning they depend on sitemaps more heavily to prioritise which pages to visit and in what order.

The lastmod attribute is particularly significant for AI indexing. When a crawler sees an accurate lastmod date, it knows whether to skip a page it has already cached or to fetch it fresh because the content has changed. Without lastmod — or with inaccurate lastmod values that don't reflect real edits — crawlers either waste budget re-fetching unchanged pages or, more commonly, deprioritise your sitemap as an unreliable signal and reduce crawl frequency across your entire domain.

A sitemap should also be declared in your robots.txt file using a Sitemap: directive. This ensures every bot — including newer AI crawlers that might not try the standard path — can find your sitemap immediately without guessing. Sites that omit this declaration rely on crawlers discovering the sitemap by convention, which is not guaranteed and adds unnecessary latency to the indexing pipeline.

For large sites, a sitemap index file (which references multiple child sitemaps, each up to 50,000 URLs or 50 MB) allows you to segment your content by type, date, or section. This structure makes it easier to submit individual sitemaps to Search Console, identify which content clusters have health issues, and update only the relevant sitemap when a specific section of your site changes.

How sitemaps affect AI search engine indexing

Generative Engine Optimization (GEO) depends on AI engines being able to read, evaluate, and cite your content in generated answers. That process begins with discovery — and sitemaps are the primary discovery mechanism for pages that aren't prominently linked from a homepage or category page. A healthy sitemap is therefore the foundation of any GEO strategy, not an afterthought.

AI crawlers like GPTBot (OpenAI), PerplexityBot (Perplexity AI), Google-Extended (Gemini / AI Overviews), and ClaudeBot (Anthropic) operate with fixed crawl budgets. Unlike traditional search crawlers that have refined decade-long heuristics for estimating a site's importance, many AI crawlers are newer and rely more heavily on explicit sitemap signals to allocate their budget effectively. A sitemap with high lastmod coverage and accurate freshness dates tells these crawlers exactly where to focus — your newest, most relevant content.

Broken URLs in a sitemap are a particularly damaging signal. When a crawler follows a sitemap entry and receives a 404 or 410 response, it learns that your sitemap metadata cannot be trusted. Over time, this degrades the crawler's confidence in your entire sitemap, reducing how much of it gets crawled on each visit. Keeping your sitemap clean — removing deleted pages promptly and updating redirected URLs to their final destinations — maintains the signal quality that keeps crawlers returning frequently and fully.

For brands pursuing GEO, the practical priority is to ensure that every page you want cited in AI-generated answers is present in a healthy sitemap, carries an accurate lastmod date, and resolves with a 200 status. Once those conditions are met, the content quality, schema markup, and authority signals you invest in can be evaluated and acted upon by AI engines — none of that work matters if the page was never discovered in the first place.

Common sitemap mistakes that limit AI indexing

Most sitemap health problems are not the result of deliberate choices — they accumulate over time as content is published, updated, and deleted without a corresponding sitemap maintenance process. These are the four patterns we see most often:

  • Broken and deleted URLs still in the sitemap. When pages are removed or redirected, their old URLs often remain in the sitemap indefinitely. Crawlers that follow these entries waste budget on dead ends and, over time, treat your sitemap as an unreliable source — which reduces indexing frequency for your healthy pages too. Remove 404 and 410 URLs immediately and replace redirected entries with their final destination URLs.
  • Missing or static lastmod dates. Some CMS platforms omit lastmod entirely; others set it once at publication and never update it, even when the page content changes substantially. Either pattern removes a key signal that AI crawlers use to prioritise re-crawling. Configure your CMS to update lastmod automatically on every significant content change and avoid setting it to a fixed date at publish time.
  • Oversized sitemaps that exceed protocol limits. The Sitemaps protocol limits each file to 50,000 URLs and 50 MB uncompressed. Sites that exceed these limits — often e-commerce or publisher sites with large product or article catalogues — may find that crawlers stop parsing mid-file, leaving a significant portion of URLs undiscovered. Split large sitemaps into thematic or date-based child files under a sitemap index and submit each individually.
  • Sitemap not declared in robots.txt. Without a Sitemap: directive in your robots.txt, newer AI crawlers may never find your sitemap. Googlebot will check the standard path by convention, but this is not guaranteed for every AI crawler. Add the full URL of your sitemap (or sitemap index) to robots.txt as a single line, for example: Sitemap: https://example.com/sitemap.xml.

To diagnose and fix these issues systematically, use this tool alongside our AI Crawler Checker — which tells you whether the crawlers that power ChatGPT, Perplexity, and Google AI Overviews are even allowed through your robots.txt before they reach your sitemap. For a deeper technical assessment, our Technical SEO service covers sitemap architecture, crawl budget optimisation, and the full technical indexability stack.

Frequently asked questions

Frequently asked questions

Go beyond diagnostics

These tools show you the gaps. We fix them.

Get a full AI visibility audit across ChatGPT, Perplexity, Gemini, and Google AI Overviews — or talk to our team about a hands-on engagement.