An XML sitemap that lists the wrong URLs is worse than no sitemap at all. The wrong list teaches Google to distrust your sitemap as a quality signal, and the Search Console indexation report gets noisy enough that you stop reading it. Most sitemaps in the wild are auto-generated by a CMS plugin nobody has reviewed, and it shows: redirected URLs, noindexed URLs, parameter duplicates, and stale <lastmod> dates that bump on every deploy.
This guide skips the "what is a sitemap" basics (the tool below covers that) and focuses on what makes a sitemap actually useful for SEO: the canonical-only inclusion rule, the five mistakes that quietly hurt rankings, and the 20-minute audit you can run before your next deploy.
Validate any sitemap right now
Paste any sitemap URL below. The tool fetches it, counts URLs, parses every entry, and flags structural issues - useful both for auditing your own site and for spot-checking competitor sitemaps to see what they're prioritising.
What belongs in a sitemap (and what doesn't)
The single rule that fixes most sitemap issues: only list canonical URLs that are indexable and return 200. Every URL in your sitemap is a request to Google to "please index this." If a URL in your sitemap is also marked noindex, also redirects, or has a canonical pointing somewhere else, you have sent Google two contradictory signals on the same URL. Search Console flags every contradiction in the Pages report, and Google's response over time is to discount the sitemap as a quality signal.
The decision rule, applied to every URL your sitemap generator suggests:
- Returns 200, no
noindex, canonical points at itself → include. - Returns 3xx (redirects somewhere) → exclude. List the redirect target instead.
- Returns 4xx or 5xx → exclude. Fix the URL, or stop linking to it.
- Has a
noindexmeta tag orX-Robots-Tagheader → exclude. The sitemap says "index me," the page says "don't." Pick one. - Canonicalises to a different URL (e.g.
?ref=emailvariants, paginated archives, tag pages that point at category pages) → exclude. Only the canonical belongs.
Most CMS-generated sitemaps fail at least three of those tests on day one. WordPress sitemap plugins love to include tag and date archives; Next.js static-export generators frequently include parameterised URLs; Shopify pulls in collection-filter URLs. The fix is usually one config change in the plugin, not a rewrite.
The fields Google actually uses (and the ones it ignores)
The sitemap protocol defines four fields per URL. For SEO, only two of them matter:
<loc>- required. The URL itself.<lastmod>- useful, when used honestly. Google treats it as a recrawl hint: if the value is recent, Googlebot may revisit the URL sooner. Bump it on real content changes and you get faster recrawls. Bump it every day on every URL (which most sitemap generators do by default) and Google starts ignoring the field entirely, because the signal is noise.<changefreq>and<priority>- dead. Google has stated publicly it ignores both. Setting every page topriority: 1.0doesn't help; settingchangefreq: hourlydoesn't change crawl behaviour. The fields exist for protocol compatibility, but for SEO they're noise. Most modern generators have stopped emitting them.
Audit move: open your sitemap in a browser and inspect 10 entries. If every URL has the same <lastmod> set to today's date, your generator is bumping on every deploy and you are training Google to ignore the signal. Switch the generator to use each page's actual modification timestamp, or drop the field if you can't get an honest value out of your CMS.
The biggest mistake: stale, contradictory, or junk entries
Most sitemap problems aren't structural ("the XML doesn't validate") but editorial: the URLs in the file don't match what the site actually wants indexed. The pattern usually looks like this. A CMS upgrade or migration changes URL structures. The sitemap generator is auto-running on every deploy, so the new sitemap regenerates fine. But nobody re-checks the indexation report for two months. By then, the sitemap has been telling Google "please index these 47,000 URLs" while the canonical tags on those URLs say "actually index these other 47,000 URLs."
The four flavours of junk you'll find in a default-generated sitemap:
- Redirected URLs. Old slugs from before a migration, or URLs that 301 to a canonical version. The sitemap should contain only the destination, never the source.
- Noindex URLs. Login pages, thank-you pages, internal admin pages, on-site search result pages. Plugins frequently include these by default. Each one is a contradictory signal.
- Non-canonical variants. Tag pages, paginated archives, query-parameter URLs, AMP variants. If the canonical points elsewhere, only the canonical belongs in the sitemap.
- Dead URLs. 404s, 410s, or URLs that redirect into a 404. These usually creep in when a sitemap generator caches the URL list and a page is deleted without invalidating the cache.
Audit move: pull your sitemap, take a 50-URL sample, and run each through Search Console URL Inspection (or the embedded checker above for status codes). Anything returning 3xx, 4xx, 5xx, or carrying a noindex tag is a sitemap entry to remove or fix.
What a clean sitemap audit looks like
Run this after any deploy that touches URL structure, after a CMS or plugin upgrade that regenerates the sitemap, after a migration, and on a quarterly cadence regardless. Takes about 20 minutes for a small site, 45 for a larger one.
- Fetch and validate the live sitemap. Use the embedded checker above on
https://yourdomain.com/sitemap.xml. Confirm it returns 200, the content-type is XML, the file parses without errors, and the URL count matches what you expect for the site. - Walk the sitemap index, if you have one. Each child sitemap should also return 200 with valid XML. A chain of broken or 5xx child sitemaps is a quiet way to lose 80% of your indexation surface without noticing.
- Reconcile against Search Console. Open the Pages report and filter by your submitted sitemap. "Submitted and indexed" is your healthy URLs. Everything in "Submitted but not indexed" is your priority audit list. Work through the categories Google groups them into ("Crawled, currently not indexed", "Discovered, currently not indexed", "Submitted URL has 'noindex'", "Submitted URL not found", and so on). Each category has a different fix.
- Sample-check 50 URLs from the sitemap. Pull a random sample. For each: does it return 200? Is the canonical self-referencing? Is there a
noindextag? If any answer is wrong, that URL doesn't belong in the sitemap until the underlying issue is fixed. - Confirm the sitemap is declared and submitted. Open
https://yourdomain.com/robots.txtand verify theSitemap:line points at the current sitemap URL (migrations regularly break this). Then open Search Console → Sitemaps and confirm the same URL is submitted there. - Audit your
<lastmod>behaviour. Open the raw XML and inspect 10 entries. If every URL has the same recent date, your generator is bumping on every deploy. Either fix the generator to use the page's real modification timestamp, or drop the field. Honest<lastmod>values give you faster recrawls; dishonest ones get the field ignored. - Re-submit after URL changes. The first deploy after a slug rename, structural change, or migration is when sitemaps go bad. Regenerate and re-submit the same day, not weeks later. Search Console shows "Couldn't fetch" or "Couldn't read" when a sitemap breaks during a deploy - check that report after every release.
Grab the one-page checklist
A printable version of the 20-minute audit, plus copy-pasteable starter sitemap configurations for WordPress, Next.js, and static sites - all pre-configured to honour the canonical-only inclusion rule.
Quick quiz: are you ready to audit your own sitemap?
Five questions, takes two minutes. We'll show you the right answer and a one-line explanation after each one.
XML sitemaps - quick check
5 randomized questions drawn from a pool of 12. Different every time you take it. Takes about two minutes.
Next up in Technical SEO
You've covered status codes (the entry-point signal), robots.txt (what gets crawled), and sitemaps (what you've asked to be indexed). The rest of the Technical SEO pillar:
- Crawlability vs indexability - the two checks Google makes before your page can rank, and why a page can pass one and fail the other.
- Mixed content and HTTPS - the audit that takes 30 seconds and often reveals a year's worth of debt.