Measurement · June 21, 2026

1 in 18 AI Crawler Requests Is Fake: What Your Server Logs Can and Cannot Tell You

5.7% of requests presenting an AI crawler user-agent string are spoofed — ChatGPT-User runs at 1:5. Here is what each server-side signal reliably detects, where each fails, and how to combine them.

One in every 18 HTTP requests presenting an AI crawler user-agent string does not come from the operator that declared it. HUMAN Security's Satori threat intelligence team analyzed traffic from 16 well-known AI crawlers and found 5.7% of those requests to be spoofed — fraudulent declarations from clients exploiting whatever preferential treatment operators extend to recognized bots. That baseline error rate sits in your server logs before any measurement begins. It is the noise floor in the portion of AI crawler traffic that announces itself.

Method

This analysis draws on HUMAN Security's 2026 State of AI Traffic & Cyberthreat Benchmark Report, covering traffic from 16 well-known AI crawlers observed across their detection network; Ahrefs' May 2026 analysis of llms.txt files across 137,000 domains measuring which bot categories actually fetch the file; and DataDome's AI traffic report covering request volumes from major AI crawler operators.

The Per-Crawler Spoof Rate Spread Is Wide

Spoof Rate by AI Crawler User-Agent String (2026)
Percentage of requests presenting each user-agent string that are fraudulent. Overall field average across 16 crawlers is 5.7%. Source: HUMAN Security 2026 State of AI Traffic & Cyberthreat Benchmark Report.

The 5.7% overall spoof rate conceals a wide range. ChatGPT-User — the user-agent string associated with live AI assistant browsing sessions, as opposed to batch training crawls — carries a 1:5 spoof ratio: approximately one in six requests is fraudulent. MistralAI-User runs at 1:37 (2.6% fake). Perplexity-User at 1:88 (1.1%).

The difference matters because operators routinely grant preferential treatment based on declared user-agent. If your rate-limiter exempts ChatGPT-User traffic, or your access controls allow ChatGPT-User to bypass authentication on pages with public content, approximately 1 in 6 of those exempted requests comes from a client that is neither the declared operator's infrastructure nor a legitimate search crawler. The exemption was built for one purpose and is being exploited for another. By raw volume: Meta-ExternalAgent was the most impersonated user-agent in the first two months of 2026 with 16.4 million spoofed requests; ChatGPT-User generated 7.9 million.

IP Range and Reverse DNS Verification

The four crawlers with the largest combined AI crawler request share — GPTBot, ClaudeBot, PerplexityBot, and Google-Extended — all publish IP range JSON files at documented endpoints. Checking an incoming request's source IP against the vendor's published range is significantly more reliable than accepting the user-agent declaration alone.

For crawlers that publish IP ranges, a single range-file check eliminates the bulk of spoofed requests. For crawlers without published ranges, forward-confirmed reverse DNS (FCrDNS) is the standard fallback: reverse-look up the request IP to get a hostname, then verify the hostname forward-resolves to the same IP. If the hostname belongs to the declared operator's registered domain, the identity is confirmed. Bytespider, which ranked as the third-largest AI crawler by HTTP request volume in May 2026, publishes neither a current IP range file nor consistent reverse DNS infrastructure. Requests presenting the Bytespider user-agent cannot be confirmed beyond the string itself.

A three-tier identification stack covers the field: user-agent string match for initial log segmentation, IP range check against the vendor's published JSON where available, and FCrDNS for remaining crawlers. Requests that clear none of these three tiers should be treated as unverified regardless of their declared identity.

llms.txt Measures the Wrong Traffic

Requesters to llms.txt Files by Category (May 2026)
Of llms.txt files that received any traffic in May 2026, AI search bots — the crawlers most likely to generate citations — made up just 1% of requests. 97% of all llms.txt files received zero bot requests. Source: Ahrefs analysis of 137,000 domains.

Ahrefs' May 2026 analysis of 137,000 domains found 97% of llms.txt files received zero requests from AI retrieval bots in the measurement period. Of the 3% that received any traffic: SEO audit tools sent 21% of total requests to those files, unidentified bots 14%, traditional web crawlers 13%, tech profiling tools 11%, AI coding agents 10%, AI training crawlers 5%, AI assistants 2%, and AI search bots 1%.

The practical consequence: instrumenting llms.txt as a traffic counter primarily measures SEO tooling and generic crawlers, not AI search bot presence. A hit from OAI-SearchBot or PerplexityBot on your llms.txt file is a meaningful signal — it indicates that crawler is performing structured discovery. Its absence does not indicate those crawlers have not indexed your content. Of all categories that regularly fetch llms.txt, AI coding agents are the most active at 10% of requests. For sites targeting developers, a coding-agent hit pattern on the file is a reliable readership signal; for measuring general AI search reach, it is not.

What This Means for Site Owners

Build the three-tier verification stack before taking any access-control action based on declared AI crawler identity. User-agent strings are the cheapest signal to capture and the easiest to fake. IP range verification against vendor-published JSON costs one lookup per request at the edge and eliminates the bulk of spoofed traffic for the four major crawlers that publish ranges. Add it to any routing or rate-limiting logic that grants preferential treatment to AI bots.

Do not use llms.txt hit rate as a proxy for AI search engine coverage. The Ahrefs data shows the bots most active on llms.txt are SEO scanners and training crawlers, not the retrieval bots that generate citations. Treating llms.txt request volume as an AI indexing signal produces a measurement biased upward by scanners and downward by the actual search crawlers that largely ignore the file.

The risk from the ChatGPT-User spoof rate is highest for operators that trigger policy changes based on declared bot identity — cache bypass, reduced rate limits, content unlocking. DataDome logged 1.7 billion requests from the dominant AI assistant platform's crawlers in a single month on their network. At a 1:5 spoof ratio on ChatGPT-User, a material fraction of that declared traffic originates elsewhere. Server-side verification is the only mechanism that distinguishes the two.

Sources

  1. HUMAN Security 2026 State of AI Traffic & Cyberthreat Benchmark Report
  2. AI Crawler Spoofing: Attackers Impersonate ChatGPT & Perplexity
  3. We Analyzed 137K Sites: 97% of llms.txt Files Never Get Read
  4. The AI Traffic Report: High Volume, Low Visibility, and a Growing Risk