A bot that audits your site like Google does
See exactly which pages Google's Helpful Content System would flag.
Free preview scans the first 100 URLs. No signup required to preview.
What a real scan finds — seennabis.com, 100-URL preview
A live preview scan against one cannabis marketplace surfaced this, in under a minute. Every finding links to the offending page, the rule it tripped, and a fix.
ORPHAN_PAGE — no other scanned page links to them. Google deprioritises pages with no internal support.
NO_H1 — every page should have exactly one. Mobile + desktop duplicates are the usual cause.
NEAR_DUPLICATE — SimHash distance ≤3. Strong cannibalisation signal between near-identical templated pages.
HEADING_SKIP — H1 → H3 with no intermediate H2. Common after blog template refactors.
IMG_NO_DIMENSIONS — missing width/height. Cumulative Layout Shift hits Core Web Vitals.
VERY_THIN — under 150 words. Two of these were SSR-bailout pages serving Googlebot only the literal text "Loading…".
The full report ranks these by total risk reduction so the highest-leverage fixes surface first. Subscribers download it as Markdown and CSV.
Every check we run, grouped by intent
Each finding carries a severity (critical / warn / info), the value we observed, the threshold we compared against, and a fix suggestion. Below is the full set — no surprises hiding behind a paywall.
Content quality
VERY_THINPages under 150 words (60 on listing pages)THINPages under 400 words (180 on listing pages)LOW_UNIQUEDistinct-word ratio under 30%AI_TELLS / HEAVY_AI_TELLS40+ stock-AI phrases — three hits warn, five flag heavySTALE_YEAROutdated year in title — e.g. a 2024 listicle still indexed in 2026SOFT_404Returns 200 but body looks like a not-found page
Page structure
NO_H1 / MULTI_H1Missing or duplicate H1 (mobile + desktop dupes are common)HEADING_SKIPH1 → H3 with no H2 in betweenREPEATED_HEADINGMultiple identical H2 sections on one pageNO_TITLE / SHORT_TITLE / LONG_TITLEMissing, under 15 chars, over 65 charsNO_META_DESC / SHORT_META_DESCMissing or under 50 chars — Google generates its own, usually poorly
Indexing & canonical
CANONICAL_MISMATCHCanonical points to a different URLCANONICAL_TO_NOINDEXCanonical target is noindex — Google indexes neitherCANONICAL_CHAINA→B→C chain; Google may stop followingNOINDEX_IN_SITEMAPPage is in sitemap but has noindex — contradictionMETA_ROBOTS_CONFLICTindex and noindex both presentMETA_NOFOLLOW / NOARCHIVE / UNAVAILABLE_AFTERLess-known robots directives reported
Schema & rich results
BROKEN_JSON_LDJSON-LD block fails to parse — Google extracts nothingDUPLICATE_JSON_LDTwo Organization blocks (page + layout) is the classic exampleOG_IMAGE_404og:image URL returns 4xx — social shares render blank
Cross-page architecture
ORPHAN_PAGENo other scanned page links here — Google deprioritisesDEEP_PAGEMore than 3 clicks from the homepage via internal linksNEAR_DUPLICATESimHash distance ≤3 — cannibalisation candidateDUPLICATE_TITLEIdentical title on multiple pages (excluding pagination)HREFLANG_ASYMMETRYSame-host alternates that don't reciprocateLINK_BLOAT / LINK_STARVEDOver 200 or under 3 internal links per pageGENERIC_ANCHOR'Click here' / 'read more' as a meaningful share of links
Site-wide hygiene
NO_ROBOTS_TXT / ROBOTS_NO_SITEMAProbots.txt missing or doesn't declare a Sitemap:ROBOTS_DISALLOW_ALL'Disallow: /' is blocking the whole siteSTALE_LLMS_TXTTLD-locale mismatch — e.g. .de site mentions Canadian contentIMG_NO_ALT / IMG_DUPLICATE_ALTImage alt audit, including cross-image duplicatesIMG_NO_DIMENSIONSImages without width/height — Cumulative Layout Shift riskNO_IMAGESLong-form page (>300 words) with no images at all
Composite: every page also gets an indexability score (0–100) rolling all signals into one number — useful for sorting big result sets.
How a scan works
- 1
Submit your URL
Paste a domain. CrawlAudit pulls /sitemap.xml — and falls back to /sitemap_index.xml or /sitemap-0.xml — to discover URLs. Same-host filtering keeps the scan focused on the property you submitted.
- 2
Chunked crawl
A worker claims 200 URLs at a time and fetches them at 20 concurrent. We use the user-agent CrawlAuditBot/1.0, follow up to 5 redirects, never execute JavaScript, never accept cookies. A 5,000-URL scan finishes in 10-15 minutes.
- 3
Per-page scoring
Each page produces structured findings: type, severity, value, threshold, why, fix. Severity-weighted so the report says 'Title 47 chars (Google truncates at 60)' rather than just labelling something SHORT_TITLE.
- 4
Cross-page + site-level checks
When the queue empties we run duplicate-title detection, hreflang reciprocity, canonical-chain detection, og:image HEAD-checks, SimHash near-duplicate clustering, BFS crawl-depth from your homepage, orphan-page detection, robots.txt + llms.txt inspection.
- 5
Fix recommendations
Findings aggregated by type and ranked by total risk reduction. The top 10 surface at the report header — 'Fix duplicate titles on 47 pages → risk drops 235' beats a 4,000-line CSV.
- 6
Download
Subscribers download the full Markdown report (human-readable, includes every finding's diagnosis + fix) and CSV (spreadsheet-ready). Re-run after fixes ship: baseline diff shows +new / −resolved.
Versus other crawlers
Generalist SEO tools cover technical signals (broken links, status codes) well. They don't score content quality against Helpful Content signals. We do.
| Static HTML scanning | CrawlAudit | Screaming Frog | Sitebulb | Ahrefs Site Audit |
|---|---|---|---|---|
| Thin-content + listing-page-aware scoring | Yes | No | Partial | No |
| AI-tell phrase detection | Yes (40+ phrases) | No | No | No |
| SimHash near-duplicate clustering | Yes | No | No | Partial |
| BFS crawl depth + orphan detection | Yes | Yes | Yes | Yes |
| Canonical-chain + canonical-to-noindex | Yes | Partial | Yes | Yes |
| og:image 404 check via HEAD | Yes | No | No | No |
| Stale-llms.txt detection | Yes | No | No | No |
| Per-finding why + suggested fix | Yes | No | Yes | Partial |
| Baseline diff between scans | Yes | Partial | Yes | Yes |
| Runs in the cloud (no laptop required) | Yes | No | No | Yes |
| Price | $9.99/mo | $259/year | $15+/mo | $129+/mo |
Frequently asked questions
Does CrawlAudit execute JavaScript like Googlebot does?
How is this different from Screaming Frog or Sitebulb?
Why should I trust the AI-tell detector?
Will the bot hammer my server?
What about indexability — can I see which pages Google won't rank?
Can I scan a site bigger than 5,000 URLs?
Do unused scans carry over?
Stop guessing. Get the list.
You don't need another opinion on whether your content is "good enough." You need the specific 47 pages Google would skip, ranked by which fixes move the needle most.
Free 100-URL preview, no card, no signup. $9.99/month unlocks the full site.