AI Crawler Signal Reliability: From Raw UA Strings to Verified Attribution
5.7% of requests bearing AI crawler user-agents are fake; for ChatGPT-User the rate is 1-in-6. Meanwhile client-side analytics captures none of it. Here is the four-layer verification stack.
5.7% of all requests carrying an AI crawler user-agent string in 2025 were fake — sent by scraping tools impersonating legitimate bots to bypass access policies. For ChatGPT-User specifically, the spoof ratio is 1-in-6, the highest of any AI crawler token tracked. At the same time, the client-side analytics stack that most engineering teams rely on captures exactly zero of those requests, real or fake, because AI crawlers never execute JavaScript. Your current AI traffic figures are likely wrong on two fronts simultaneously.
Method
Spoof-rate figures are from HUMAN Security's Satori Threat Intelligence team, which analysed inbound traffic across 16 AI crawler user-agent tokens during 2025. Bot-composition figures are from the Imperva 2026 Bad Bot Report, published April 2026, covering full-year 2025 traffic data across hundreds of billions of web requests. AI-traffic growth figures are also from the Imperva dataset.
1. The JavaScript Execution Gap
Client-side analytics — GA4, Plausible, Fathom — fire only when a tracking script executes inside a browser. AI crawlers fetch raw HTML over HTTP and terminate immediately. The script never loads. This is structural, not a configuration problem; no settings change fixes it.
GA4's built-in IAB/ABC International Spiders and Bots List filtering is irrelevant here. That list catches traditional crawlers (search engine bots, monitoring agents) that fire GA4's beacon — but AI crawlers never fire it. Teams reviewing clean GA4 dashboards are not seeing AI traffic filtered away cleanly. They are seeing a structural gap where no data was ever collected.
The data sources that do record AI crawler visits are:
- Server access logs (nginx, Apache, or CDN edge logs): capture every HTTP request regardless of whether JavaScript runs. Log parsing infrastructure — at minimum a log management platform with regex — is required to extract AI crawler lines from the full request firehose.
- CDN-level analytics: some edge network operators expose per-user-agent request counts in their analytics portals. Coverage and granularity vary by vendor.
- Server-log attribution tools: purpose-built products that ingest access logs and attribute AI crawler activity at the page and session level.
For any site that currently relies only on GA4 or a client-side equivalent, measured AI crawler traffic is literally zero — not low, not approximate, zero.
2. The Spoofing Rate
Switching to server log analysis recovers visibility but introduces a second problem: user-agent headers are trivially writable by any HTTP client. A scraping tool can set its UA string to GPTBot/1.0 or ChatGPT-User with a single line of configuration.
HUMAN Security's Satori team analysed 16 AI crawler user-agent tokens across a broad inbound traffic sample in 2025. Across all 16 tokens, 5.7% of requests (1 in 18) did not originate from the claimed crawler — source IPs resolved to unrelated infrastructure. ChatGPT-User was the most impersonated: 1 in every 6 requests bearing that string did not come from the legitimate crawler. The incentive is structural: sites that allowlist known AI crawlers create an arbitrage for any scraper willing to impersonate one.
If your analytics pipeline accepts user-agent strings at face value, roughly 1 in 18 AI crawler log lines you are counting is fabricated. For ChatGPT-User lines specifically, that ratio is 1 in 6.
3. Which Crawlers Offer Verifiable Signals
Not all AI crawlers present the same verification surface. Three mechanisms exist:
Published CIDR range files. Several major AI crawlers whose operators want site owners to grant access publish machine-readable JSON files listing the IP ranges their crawlers use. You fetch these files, build a local allowlist, and match the source IP of each incoming request against it. GPTBot publishes three separate files covering training crawls, search, and user-triggered requests. PerplexityBot, BingBot, and CommonCrawl's CCBot each publish at least one range file. Polling frequency matters: ranges update without notice, so a daily fetch with a local cache is the minimum viable pattern.
Reverse-DNS PTR confirmation. For crawlers that do not publish CIDR files, the canonical verification method is: reverse-DNS the source IP to get a hostname, verify that the hostname falls within the crawler operator's registered domain, then forward-DNS that hostname back to confirm it resolves to the same IP. This two-step check eliminates the vast majority of spoofed traffic without requiring a continuously refreshed allowlist. ClaudeBot is verified this way.
No machine-verifiable signal. Some crawlers — particularly newer AI entrants and scrapers impersonating AI crawlers — neither publish CIDR files nor resolve to an identifiable operator domain. These can only be classified as unverified UA-string matches, which puts them in the same reliability class as raw log-line UA headers generally.
4. Building the Measurement Stack
A reliable AI crawler measurement pipeline stacks all three mechanisms on top of a server-log foundation:
- Server access logs as the data substrate. Route all page requests through infrastructure that captures access logs. Client-side analytics cannot substitute at this layer.
- UA-string extraction as the first filter. Parse logs for known AI crawler tokens. Treat these counts as upper bounds before verification — they include spoofed traffic.
- IP range file verification. For crawlers that publish JSON range files, check source IPs against daily-fetched allowlists. Requests outside the listed ranges are spoofed regardless of UA string.
- Reverse-DNS confirmation for crawlers without range files. Run PTR and forward-DNS lookups on source IPs. Implement local caching — live PTR lookups per-request at volume add latency.
Applying layers 3 and 4 to raw log data will typically trim your apparent AI crawler count by 5–20%, depending on how much scraping activity is impersonating AI crawlers on your property. That trimmed figure is your verified AI crawler request count.
What This Means for Site Owners
The measurement gap is not symmetric. GA4's blind spot affects all AI crawler traffic equally — it is a hard structural zero that no configuration change fixes. The spoofing problem affects server log counts in proportion to how attractive your content is to unauthorized scrapers: high-value content, paywalled properties, and e-commerce catalogues see higher impersonation rates.
The practical sequence: confirm server log access, verify you can grep for known UA tokens and get non-zero line counts, then build the CIDR polling pipeline for crawlers that publish range files. Reverse-DNS confirmation adds a second layer on top. Running all four tiers produces a verified count. Running only tier 2 (raw UA strings) inflates your figure by 5–20% above the actual volume — and still misses everything GA4 was supposed to measure.
AI-driven web traffic grew 187% in 2025. Getting the measurement right matters more as the scale increases.