CCrawlAudit

A bot that audits your site like Google does

See exactly which pages Google's Helpful Content System would flag.

Free preview scans the first 100 URLs. No signup required to preview.

35+
distinct finding types
5,000
URLs per scan
10–15 min
full-site scan time
$9.99
per month, cancel anytime

What a real scan finds — seennabis.com, 100-URL preview

A live preview scan against one cannabis marketplace surfaced this, in under a minute. Every finding links to the offending page, the rule it tripped, and a fix.

48

ORPHAN_PAGE — no other scanned page links to them. Google deprioritises pages with no internal support.

33

NO_H1 — every page should have exactly one. Mobile + desktop duplicates are the usual cause.

31

NEAR_DUPLICATE — SimHash distance ≤3. Strong cannibalisation signal between near-identical templated pages.

65

HEADING_SKIP — H1 → H3 with no intermediate H2. Common after blog template refactors.

26

IMG_NO_DIMENSIONS — missing width/height. Cumulative Layout Shift hits Core Web Vitals.

6

VERY_THIN — under 150 words. Two of these were SSR-bailout pages serving Googlebot only the literal text "Loading…".

The full report ranks these by total risk reduction so the highest-leverage fixes surface first. Subscribers download it as Markdown and CSV.

Every check we run, grouped by intent

Each finding carries a severity (critical / warn / info), the value we observed, the threshold we compared against, and a fix suggestion. Below is the full set — no surprises hiding behind a paywall.

Content quality

  • VERY_THINPages under 150 words (60 on listing pages)
  • THINPages under 400 words (180 on listing pages)
  • LOW_UNIQUEDistinct-word ratio under 30%
  • AI_TELLS / HEAVY_AI_TELLS40+ stock-AI phrases — three hits warn, five flag heavy
  • STALE_YEAROutdated year in title — e.g. a 2024 listicle still indexed in 2026
  • SOFT_404Returns 200 but body looks like a not-found page

Page structure

  • NO_H1 / MULTI_H1Missing or duplicate H1 (mobile + desktop dupes are common)
  • HEADING_SKIPH1 → H3 with no H2 in between
  • REPEATED_HEADINGMultiple identical H2 sections on one page
  • NO_TITLE / SHORT_TITLE / LONG_TITLEMissing, under 15 chars, over 65 chars
  • NO_META_DESC / SHORT_META_DESCMissing or under 50 chars — Google generates its own, usually poorly

Indexing & canonical

  • CANONICAL_MISMATCHCanonical points to a different URL
  • CANONICAL_TO_NOINDEXCanonical target is noindex — Google indexes neither
  • CANONICAL_CHAINA→B→C chain; Google may stop following
  • NOINDEX_IN_SITEMAPPage is in sitemap but has noindex — contradiction
  • META_ROBOTS_CONFLICTindex and noindex both present
  • META_NOFOLLOW / NOARCHIVE / UNAVAILABLE_AFTERLess-known robots directives reported

Schema & rich results

  • BROKEN_JSON_LDJSON-LD block fails to parse — Google extracts nothing
  • DUPLICATE_JSON_LDTwo Organization blocks (page + layout) is the classic example
  • OG_IMAGE_404og:image URL returns 4xx — social shares render blank

Cross-page architecture

  • ORPHAN_PAGENo other scanned page links here — Google deprioritises
  • DEEP_PAGEMore than 3 clicks from the homepage via internal links
  • NEAR_DUPLICATESimHash distance ≤3 — cannibalisation candidate
  • DUPLICATE_TITLEIdentical title on multiple pages (excluding pagination)
  • HREFLANG_ASYMMETRYSame-host alternates that don't reciprocate
  • LINK_BLOAT / LINK_STARVEDOver 200 or under 3 internal links per page
  • GENERIC_ANCHOR'Click here' / 'read more' as a meaningful share of links

Site-wide hygiene

  • NO_ROBOTS_TXT / ROBOTS_NO_SITEMAProbots.txt missing or doesn't declare a Sitemap:
  • ROBOTS_DISALLOW_ALL'Disallow: /' is blocking the whole site
  • STALE_LLMS_TXTTLD-locale mismatch — e.g. .de site mentions Canadian content
  • IMG_NO_ALT / IMG_DUPLICATE_ALTImage alt audit, including cross-image duplicates
  • IMG_NO_DIMENSIONSImages without width/height — Cumulative Layout Shift risk
  • NO_IMAGESLong-form page (>300 words) with no images at all

Composite: every page also gets an indexability score (0–100) rolling all signals into one number — useful for sorting big result sets.

How a scan works

  1. 1

    Submit your URL

    Paste a domain. CrawlAudit pulls /sitemap.xml — and falls back to /sitemap_index.xml or /sitemap-0.xml — to discover URLs. Same-host filtering keeps the scan focused on the property you submitted.

  2. 2

    Chunked crawl

    A worker claims 200 URLs at a time and fetches them at 20 concurrent. We use the user-agent CrawlAuditBot/1.0, follow up to 5 redirects, never execute JavaScript, never accept cookies. A 5,000-URL scan finishes in 10-15 minutes.

  3. 3

    Per-page scoring

    Each page produces structured findings: type, severity, value, threshold, why, fix. Severity-weighted so the report says 'Title 47 chars (Google truncates at 60)' rather than just labelling something SHORT_TITLE.

  4. 4

    Cross-page + site-level checks

    When the queue empties we run duplicate-title detection, hreflang reciprocity, canonical-chain detection, og:image HEAD-checks, SimHash near-duplicate clustering, BFS crawl-depth from your homepage, orphan-page detection, robots.txt + llms.txt inspection.

  5. 5

    Fix recommendations

    Findings aggregated by type and ranked by total risk reduction. The top 10 surface at the report header — 'Fix duplicate titles on 47 pages → risk drops 235' beats a 4,000-line CSV.

  6. 6

    Download

    Subscribers download the full Markdown report (human-readable, includes every finding's diagnosis + fix) and CSV (spreadsheet-ready). Re-run after fixes ship: baseline diff shows +new / −resolved.

Versus other crawlers

Generalist SEO tools cover technical signals (broken links, status codes) well. They don't score content quality against Helpful Content signals. We do.

Static HTML scanningCrawlAuditScreaming FrogSitebulbAhrefs Site Audit
Thin-content + listing-page-aware scoringYesNoPartialNo
AI-tell phrase detectionYes (40+ phrases)NoNoNo
SimHash near-duplicate clusteringYesNoNoPartial
BFS crawl depth + orphan detectionYesYesYesYes
Canonical-chain + canonical-to-noindexYesPartialYesYes
og:image 404 check via HEADYesNoNoNo
Stale-llms.txt detectionYesNoNoNo
Per-finding why + suggested fixYesNoYesPartial
Baseline diff between scansYesPartialYesYes
Runs in the cloud (no laptop required)YesNoNoYes
Price$9.99/mo$259/year$15+/mo$129+/mo

Frequently asked questions

Does CrawlAudit execute JavaScript like Googlebot does?
No — static HTML only, by design. Googlebot's first-pass index uses static HTML; the rendered DOM gets a second pass much later. Most ranking decisions happen on the first pass. If your content is invisible to us, it's invisible to Google's initial crawl too — and that's exactly what we want to surface.
How is this different from Screaming Frog or Sitebulb?
Those tools cover technical SEO — broken links, status codes, redirect chains. CrawlAudit specifically scores content quality against Helpful Content signals: thin pages, AI footprints, templated intros, near-duplicates, missing structured data. We score what they don't. We also run in the cloud, so a 5,000-URL scan doesn't tie up your laptop for 20 minutes.
Why should I trust the AI-tell detector?
It detects pattern-presence, not 'AI vs human.' We don't claim to identify which model wrote what. We flag 40+ stock phrases that real human editors strip out — filler transitions, generic summary tags, vague intensifiers. If your page hits five of them, it reads like an unedited draft, whether a person, a model, or both produced it. The fix is the same: edit it. The full pattern list is on /about.
Will the bot hammer my server?
No. 20 concurrent requests max, 15-second per-request timeout. A 5,000-URL scan is roughly equivalent to one user browsing for 15 minutes. The user-agent is CrawlAuditBot/1.0 — block or rate-limit if you want.
What about indexability — can I see which pages Google won't rank?
Every page gets an indexability score from 0 to 100. Pages with noindex score 0. Soft 404s score 5. Thin pages score 10-40. Pages with clean structure, sitemap inclusion, and a self-canonical score 95+. The composite rolls up signals into one number you can sort by.
Can I scan a site bigger than 5,000 URLs?
Rotate scans across sections by pointing CrawlAudit at sub-sitemaps separately (e.g. /sitemaps/blog.xml then /sitemaps/products.xml). Email us if you need a higher cap built into the plan.
Do unused scans carry over?
No — the 5-scan allotment resets monthly. Most sites scan once after a content push, then once more two weeks later to verify fixes.

Stop guessing. Get the list.

You don't need another opinion on whether your content is "good enough." You need the specific 47 pages Google would skip, ranked by which fixes move the needle most.

Free 100-URL preview, no card, no signup. $9.99/month unlocks the full site.