What we check
Every signal below is something Google's Helpful Content System has been observed weighting since 2022. We don't guess at expertise — we score the mechanical fingerprints of low-effort content.
Every check we run
Thin content
Word count below 400 (article) / 180 (listing page) flags as thin; under 150 / 60 flags as very thin. Heaviest signal Google uses to filter Helpful Content. Common offenders: tag archives, empty category pages, AI-fluff product pages.
Low unique-word ratio
How varied is the vocabulary on the page? Pages padded with repeated phrases or boilerplate score under 30% unique and get flagged. Pages with a self-canonical and sitemap inclusion get the flag downgraded to info — Google has likely already chosen to crawl them.
AI-tell phrases
A curated list of 40+ stock phrases that signal an AI draft was never edited. Includes filler transitions, vague intensifiers, and the kind of generic summary tags ChatGPT-style models reach for by default. Three hits flags the page; five or more flags it heavily. The detector matches case-insensitively on substring presence; we are conservative about which phrases qualify, so false positives are rare.
Heading structure
No H1, multiple H1s (mobile + desktop responsive duplicates are a common cause). Google's structure heuristics depend on a clean H1 → H2 hierarchy.
Title and meta description
Missing title, short (<15 chars), or long (>65 chars — Google truncates). Missing meta description or under 50 chars. Each gets a fix-suggestion with the exact char count and recommended range.
Canonical correctness
Canonical missing, canonical pointing to a different URL (info), canonical pointing to a noindex page (critical — Google indexes neither), canonical chain (A→B→C). Each is detected and reported with the offending target URL.
Soft 404
Page returns HTTP 200 but body or title says "not found" / "doesn't exist". Google treats these as soft-404s and excludes them from the index. Critical severity.
JSON-LD validity
We parse every <script type="application/ld+json"> block. Invalid JSON = BROKEN_JSON_LD. Multiple blocks declaring the same @type on one page (common bug: page-level + layout-level both emit Organization) = DUPLICATE_JSON_LD.
og:image accessibility
We HEAD-check every unique og:image URL. 4xx = OG_IMAGE_404. Social shares and Google's image preview render blank when this breaks.
Duplicate titles (cross-page)
Same <title> on multiple pages, excluding pagination variants that share a canonical. Google may pick one and exclude the rest.
Hreflang reciprocity
We check that same-host hreflang alternates point back. Cross-TLD alternates are skipped (we can't verify them without scanning the other site too). Asymmetric sets get ignored by Google.
Site-level: robots.txt + llms.txt
Missing robots.txt, missing Sitemap: declaration, "Disallow: /" footgun. Plus we check llms.txt for TLD-locale mismatches (e.g. a .de site mentioning Canadian content).
Broken SSR ("Loading…" gates)
Pages that bail out to client-side rendering — Googlebot sees only "Loading…" with no body content. We catch these because the static-HTML word count comes back near zero on a page that should have content.
Listing-page detection
Pages with ≥20 internal links, ≥5% link-to-text ratio, and ≥40% of links sharing a top URL segment are scored as listing pages with looser word-count thresholds — their value isn't in body copy, it's in navigation.
Cross-scan diff
Every scan auto-links to your most recent scan for the same site and shows "+N new findings, −N resolved." Use this to verify a fix actually moved the needle.
URL-pattern roll-up
We group pages by URL pattern (/blog/*, /products/*, /tag/*) and report flag distribution per group. Often shows that one section is dragging the whole site down.
How a scan works
- Submit your site URL. Paste your domain (e.g. example.com) on the CrawlAudit homepage. No signup needed for the 100-URL preview.
- We fetch your sitemap. CrawlAudit pulls /sitemap.xml (or /sitemap_index.xml) to discover the URLs to scan. For sites without a sitemap, we crawl from the homepage.
- Chunked crawl. A serverless worker claims 200 URLs at a time, fetches at 20 concurrent, scores each page, and fire-and-forgets the next chunk. 5,000 URLs takes 10-15 minutes.
- Cross-page checks at finalize. When the queue empties, we run duplicate-title detection, hreflang reciprocity, canonical-chain detection, og:image HEAD-check, and site-level robots.txt + llms.txt checks.
- Fix recommendations are computed. Findings are aggregated by type and ranked by total risk reduction across the site. The top 10 appear at the report header.
- Download the report. Subscribers download Markdown (human-readable) and CSV (spreadsheet-ready) reports with every flagged page, the why behind each finding, and the fix.
Frequently asked questions
Does CrawlAudit execute JavaScript like Googlebot does?
Can I trust the LOW_UNIQUE flag?
info severity when a page has a self-canonical and is in the sitemap — that combination is a strong signal Google has already chosen to crawl the page. The flag still appears, so you can prioritise enrichment if you want, but it stops dragging your overall risk score down.How big a site can you scan?
Will CrawlAudit hammer my server?
CrawlAuditBot/1.0 so you can identify or block us if needed.