What we check

Every signal below is something Google's Helpful Content System has been observed weighting since 2022. We don't guess at expertise — we score the mechanical fingerprints of low-effort content.

Every check we run

Thin content

Word count below 400 (article) / 180 (listing page) flags as thin; under 150 / 60 flags as very thin. Heaviest signal Google uses to filter Helpful Content. Common offenders: tag archives, empty category pages, AI-fluff product pages.

Low unique-word ratio

How varied is the vocabulary on the page? Pages padded with repeated phrases or boilerplate score under 30% unique and get flagged. Pages with a self-canonical and sitemap inclusion get the flag downgraded to info — Google has likely already chosen to crawl them.

AI-tell phrases

A curated list of 40+ stock phrases that signal an AI draft was never edited. Includes filler transitions, vague intensifiers, and the kind of generic summary tags ChatGPT-style models reach for by default. Three hits flags the page; five or more flags it heavily. The detector matches case-insensitively on substring presence; we are conservative about which phrases qualify, so false positives are rare.

Heading structure

No H1, multiple H1s (mobile + desktop responsive duplicates are a common cause). Google's structure heuristics depend on a clean H1 → H2 hierarchy.

Title and meta description

Missing title, short (<15 chars), or long (>65 chars — Google truncates). Missing meta description or under 50 chars. Each gets a fix-suggestion with the exact char count and recommended range.

Canonical correctness

Canonical missing, canonical pointing to a different URL (info), canonical pointing to a noindex page (critical — Google indexes neither), canonical chain (A→B→C). Each is detected and reported with the offending target URL.

Soft 404

Page returns HTTP 200 but body or title says "not found" / "doesn't exist". Google treats these as soft-404s and excludes them from the index. Critical severity.

JSON-LD validity

We parse every <script type="application/ld+json"> block. Invalid JSON = BROKEN_JSON_LD. Multiple blocks declaring the same @type on one page (common bug: page-level + layout-level both emit Organization) = DUPLICATE_JSON_LD.

og:image accessibility

We HEAD-check every unique og:image URL. 4xx = OG_IMAGE_404. Social shares and Google's image preview render blank when this breaks.

Duplicate titles (cross-page)

Same <title> on multiple pages, excluding pagination variants that share a canonical. Google may pick one and exclude the rest.

Hreflang reciprocity

We check that same-host hreflang alternates point back. Cross-TLD alternates are skipped (we can't verify them without scanning the other site too). Asymmetric sets get ignored by Google.

Site-level: robots.txt + llms.txt

Missing robots.txt, missing Sitemap: declaration, "Disallow: /" footgun. Plus we check llms.txt for TLD-locale mismatches (e.g. a .de site mentioning Canadian content).

Broken SSR ("Loading…" gates)

Pages that bail out to client-side rendering — Googlebot sees only "Loading…" with no body content. We catch these because the static-HTML word count comes back near zero on a page that should have content.

Listing-page detection

Pages with ≥20 internal links, ≥5% link-to-text ratio, and ≥40% of links sharing a top URL segment are scored as listing pages with looser word-count thresholds — their value isn't in body copy, it's in navigation.

Cross-scan diff

Every scan auto-links to your most recent scan for the same site and shows "+N new findings, −N resolved." Use this to verify a fix actually moved the needle.

URL-pattern roll-up

We group pages by URL pattern (/blog/*, /products/*, /tag/*) and report flag distribution per group. Often shows that one section is dragging the whole site down.

How a scan works

Submit your site URL. Paste your domain (e.g. example.com) on the CrawlAudit homepage. No signup needed for the 100-URL preview.
We fetch your sitemap. CrawlAudit pulls /sitemap.xml (or /sitemap_index.xml) to discover the URLs to scan. For sites without a sitemap, we crawl from the homepage.
Chunked crawl. A serverless worker claims 200 URLs at a time, fetches at 20 concurrent, scores each page, and fire-and-forgets the next chunk. 5,000 URLs takes 10-15 minutes.
Cross-page checks at finalize. When the queue empties, we run duplicate-title detection, hreflang reciprocity, canonical-chain detection, og:image HEAD-check, and site-level robots.txt + llms.txt checks.
Fix recommendations are computed. Findings are aggregated by type and ranked by total risk reduction across the site. The top 10 appear at the report header.
Download the report. Subscribers download Markdown (human-readable) and CSV (spreadsheet-ready) reports with every flagged page, the why behind each finding, and the fix.

Frequently asked questions

Does CrawlAudit execute JavaScript like Googlebot does?

No — we fetch the static HTML only. This is deliberate: it mirrors Googlebot's first-pass index, which uses the static HTML and only re-fetches with a rendered DOM later. If a page is empty in static HTML, it's very likely missing from Google's index too.

Can I trust the LOW_UNIQUE flag?

We auto-downgrade LOW_UNIQUE (and THIN) to info severity when a page has a self-canonical and is in the sitemap — that combination is a strong signal Google has already chosen to crawl the page. The flag still appears, so you can prioritise enrichment if you want, but it stops dragging your overall risk score down.

How big a site can you scan?

Up to 5,000 URLs per scan, 5 full scans per month on the $9.99 plan. If you have a 50,000-URL site, you can rotate scans across sections via different sitemap entries — get in touch and we'll help you carve it up.

Will CrawlAudit hammer my server?

We fetch at 20 concurrent requests at most, with a 15-second timeout. A 5,000-URL scan is roughly equivalent to one normal browsing session over 10-15 minutes. We use the user-agent CrawlAuditBot/1.0 so you can identify or block us if needed.

Why not use Screaming Frog or Sitebulb instead?

Those tools are great for purely technical SEO (broken links, status codes, redirect chains). CrawlAudit specifically scores content quality against Helpful Content signals: thin pages, AI footprints, templated intros, cross-page duplicate titles. Most desktop crawlers don't score any of that. Also: they cost $200+/year vs $9.99/month, and they run on your laptop instead of in the cloud.