Technical SEO·April 27, 2026·10 min read

The XML sitemap playbook for SEO

An XML sitemap that lists the wrong URLs is worse than no sitemap at all - it teaches Google to distrust your sitemap as a quality signal. A practical playbook for the canonical-only inclusion rule, the four flavours of junk in default-generated sitemaps, and the 20-minute audit you can run before your next deploy.

An XML sitemap that lists the wrong URLs is worse than no sitemap at all. The wrong list teaches Google to distrust your sitemap as a quality signal, and the Search Console indexation report gets noisy enough that you stop reading it. Most sitemaps in the wild are auto-generated by a CMS plugin nobody has reviewed, and it shows: redirected URLs, noindexed URLs, parameter duplicates, and stale <lastmod> dates that bump on every deploy.

This guide skips the "what is a sitemap" basics (the tool below covers that) and focuses on what makes a sitemap actually useful for SEO: the canonical-only inclusion rule, the five mistakes that quietly hurt rankings, and the 20-minute audit you can run before your next deploy.

Validate any sitemap right now

Paste any sitemap URL below. The tool fetches it, counts URLs, parses every entry, and flags structural issues - useful both for auditing your own site and for spot-checking competitor sitemaps to see what they're prioritising.

Try it inline

XML Sitemap Checker

Validate any XML sitemap, count URLs, and flag errors. No login, works on any domain.

Open full tool
Loading tool…

What belongs in a sitemap (and what doesn't)

The single rule that fixes most sitemap issues: only list canonical URLs that are indexable and return 200. Every URL in your sitemap is a request to Google to "please index this." If a URL in your sitemap is also marked noindex, also redirects, or has a canonical pointing somewhere else, you have sent Google two contradictory signals on the same URL. Search Console flags every contradiction in the Pages report, and Google's response over time is to discount the sitemap as a quality signal.

The decision rule, applied to every URL your sitemap generator suggests:

  • Returns 200, no noindex, canonical points at itself → include.
  • Returns 3xx (redirects somewhere) → exclude. List the redirect target instead.
  • Returns 4xx or 5xx → exclude. Fix the URL, or stop linking to it.
  • Has a noindex meta tag or X-Robots-Tag header → exclude. The sitemap says "index me," the page says "don't." Pick one.
  • Canonicalises to a different URL (e.g. ?ref=email variants, paginated archives, tag pages that point at category pages) → exclude. Only the canonical belongs.

Most CMS-generated sitemaps fail at least three of those tests on day one. WordPress sitemap plugins love to include tag and date archives; Next.js static-export generators frequently include parameterised URLs; Shopify pulls in collection-filter URLs. The fix is usually one config change in the plugin, not a rewrite.

The fields Google actually uses (and the ones it ignores)

The sitemap protocol defines four fields per URL. For SEO, only two of them matter:

  • <loc> - required. The URL itself.
  • <lastmod> - useful, when used honestly. Google treats it as a recrawl hint: if the value is recent, Googlebot may revisit the URL sooner. Bump it on real content changes and you get faster recrawls. Bump it every day on every URL (which most sitemap generators do by default) and Google starts ignoring the field entirely, because the signal is noise.
  • <changefreq> and <priority> - dead. Google has stated publicly it ignores both. Setting every page to priority: 1.0 doesn't help; setting changefreq: hourly doesn't change crawl behaviour. The fields exist for protocol compatibility, but for SEO they're noise. Most modern generators have stopped emitting them.

Audit move: open your sitemap in a browser and inspect 10 entries. If every URL has the same <lastmod> set to today's date, your generator is bumping on every deploy and you are training Google to ignore the signal. Switch the generator to use each page's actual modification timestamp, or drop the field if you can't get an honest value out of your CMS.

The biggest mistake: stale, contradictory, or junk entries

Most sitemap problems aren't structural ("the XML doesn't validate") but editorial: the URLs in the file don't match what the site actually wants indexed. The pattern usually looks like this. A CMS upgrade or migration changes URL structures. The sitemap generator is auto-running on every deploy, so the new sitemap regenerates fine. But nobody re-checks the indexation report for two months. By then, the sitemap has been telling Google "please index these 47,000 URLs" while the canonical tags on those URLs say "actually index these other 47,000 URLs."

The four flavours of junk you'll find in a default-generated sitemap:

  1. Redirected URLs. Old slugs from before a migration, or URLs that 301 to a canonical version. The sitemap should contain only the destination, never the source.
  2. Noindex URLs. Login pages, thank-you pages, internal admin pages, on-site search result pages. Plugins frequently include these by default. Each one is a contradictory signal.
  3. Non-canonical variants. Tag pages, paginated archives, query-parameter URLs, AMP variants. If the canonical points elsewhere, only the canonical belongs in the sitemap.
  4. Dead URLs. 404s, 410s, or URLs that redirect into a 404. These usually creep in when a sitemap generator caches the URL list and a page is deleted without invalidating the cache.

Audit move: pull your sitemap, take a 50-URL sample, and run each through Search Console URL Inspection (or the embedded checker above for status codes). Anything returning 3xx, 4xx, 5xx, or carrying a noindex tag is a sitemap entry to remove or fix.

Checklist

XML sitemap DOs & DON'Ts

DO

  • List only canonical, indexable, 200-OK URLs

    A sitemap is a list of URLs you want indexed. Every entry should be the canonical version, return 200, and be free of noindex. Anything else is a mixed signal Google has to spend crawl budget reconciling.

  • Update <lastmod> when the page's content actually changes

    Google uses <lastmod> as a hint to recrawl. Bump it on real edits and Googlebot returns sooner; bump it every day on every URL and Google starts ignoring the field entirely.

  • Split into multiple sitemaps once you exceed 50k URLs (or 50MB)

    These are hard limits in the protocol. Use a sitemap index that points to per-section sitemaps so you can also see indexation rates per content type in Search Console.

  • Submit the sitemap in Search Console AND declare it in robots.txt

    Belt and braces. Search Console gives you the indexation report; the robots.txt declaration ensures any crawler (not just Google) finds it.

  • Reconcile your sitemap against indexation in Search Console

    Open the Pages report → filter by your sitemap. URLs in the sitemap that Google chose not to index are the highest-signal list of pages with quality, duplication, or canonical problems.

DON'T

  • Don't include URLs that are noindex, redirected, or 404

    Each one is a contradiction. The sitemap says "index this" while the page says "don't." Google flags these in Search Console as "Submitted URL has 'noindex'" or "Submitted URL not found."

  • Don't list non-canonical variants

    If /products/widget?ref=email canonicalises to /products/widget, only the canonical belongs in the sitemap. Listing parameterized duplicates wastes crawl budget on URLs you've already told Google to ignore.

  • Don't auto-generate <priority> or <changefreq> values

    Google has publicly said it ignores both fields. Setting every page to priority 1.0 doesn't help; setting them honestly takes effort that produces no SEO benefit. Skip them entirely.

  • Don't ship a sitemap that 404s, 5xxs, or returns text/html

    All three break sitemap discovery. A sitemap must return 200 with a valid XML content-type. Static-export frameworks frequently regress this after upgrades.

  • Don't let your sitemap fall stale after a migration

    Slug renames, URL-structure changes, and CMS migrations are when sitemaps go bad. The first deploy after migration is when you should regenerate and re-submit, not weeks later.

What a clean sitemap audit looks like

Run this after any deploy that touches URL structure, after a CMS or plugin upgrade that regenerates the sitemap, after a migration, and on a quarterly cadence regardless. Takes about 20 minutes for a small site, 45 for a larger one.

  1. Fetch and validate the live sitemap. Use the embedded checker above on https://yourdomain.com/sitemap.xml. Confirm it returns 200, the content-type is XML, the file parses without errors, and the URL count matches what you expect for the site.
  2. Walk the sitemap index, if you have one. Each child sitemap should also return 200 with valid XML. A chain of broken or 5xx child sitemaps is a quiet way to lose 80% of your indexation surface without noticing.
  3. Reconcile against Search Console. Open the Pages report and filter by your submitted sitemap. "Submitted and indexed" is your healthy URLs. Everything in "Submitted but not indexed" is your priority audit list. Work through the categories Google groups them into ("Crawled, currently not indexed", "Discovered, currently not indexed", "Submitted URL has 'noindex'", "Submitted URL not found", and so on). Each category has a different fix.
  4. Sample-check 50 URLs from the sitemap. Pull a random sample. For each: does it return 200? Is the canonical self-referencing? Is there a noindex tag? If any answer is wrong, that URL doesn't belong in the sitemap until the underlying issue is fixed.
  5. Confirm the sitemap is declared and submitted. Open https://yourdomain.com/robots.txt and verify the Sitemap: line points at the current sitemap URL (migrations regularly break this). Then open Search Console → Sitemaps and confirm the same URL is submitted there.
  6. Audit your <lastmod> behaviour. Open the raw XML and inspect 10 entries. If every URL has the same recent date, your generator is bumping on every deploy. Either fix the generator to use the page's real modification timestamp, or drop the field. Honest <lastmod> values give you faster recrawls; dishonest ones get the field ignored.
  7. Re-submit after URL changes. The first deploy after a slug rename, structural change, or migration is when sitemaps go bad. Regenerate and re-submit the same day, not weeks later. Search Console shows "Couldn't fetch" or "Couldn't read" when a sitemap breaks during a deploy - check that report after every release.

Grab the one-page checklist

A printable version of the 20-minute audit, plus copy-pasteable starter sitemap configurations for WordPress, Next.js, and static sites - all pre-configured to honour the canonical-only inclusion rule.

Free download

The XML Sitemap Audit Checklist

A printable one-pager with the 20-minute sitemap audit, the canonical-only inclusion rule, and copy-pasteable sitemap templates for WordPress, Next.js, and static sites.

Quick quiz: are you ready to audit your own sitemap?

Five questions, takes two minutes. We'll show you the right answer and a one-line explanation after each one.

Quick quiz · 5 questions

XML sitemaps - quick check

5 randomized questions drawn from a pool of 12. Different every time you take it. Takes about two minutes.

Next up in Technical SEO

You've covered status codes (the entry-point signal), robots.txt (what gets crawled), and sitemaps (what you've asked to be indexed). The rest of the Technical SEO pillar:

  • Crawlability vs indexability - the two checks Google makes before your page can rank, and why a page can pass one and fail the other.
  • Mixed content and HTTPS - the audit that takes 30 seconds and often reveals a year's worth of debt.
Keep learning

More in Technical SEO

How to audit HTTP status codes for SEO

9 min read

How to audit your robots.txt for SEO

10 min read

How to audit crawlability and indexability

11 min read

Skip the writing. Keep the SEO.

SEOGraphy drafts, illustrates, and publishes articles that follow the playbook above - automatically.

Try SEOGraphy free →