robots.txt Controls 60% of Sites Tried. The Compliance Gap Is Real.
79% of top news sites now block AI training crawlers via robots.txt. Publishers that blocked lost 23% of monthly traffic. A network security report caught one AI search provider deploying stealth crawlers to evade the blocks.
In January 2026, 79% of the top 100 news sites by traffic were blocking at least one AI training crawler via robots.txt — up from 23% of reputable sites in September 2023. Within six weeks of adding those rules, sites in one cohort lost an average of 7% of weekly traffic. Measured over a longer window — October 2022 through June 2025 — publishers that added AI bot restrictions experienced a 23.1% decline in total monthly visits and a 13.9% drop in human-only sessions. The mechanism intended to protect content from AI data collection is producing documented traffic costs while the crawl-prevention effect has significant gaps.
Method
This post draws on four sources. The BuzzStream study (January 2026) analyzed robots.txt configurations across the top 100 US and UK news sites by traffic, recording which AI user-agents each site disallowed and the training-vs-retrieval distinction in their configurations. The Zhao–Berman working paper "The Impact of LLMs on Online News Consumption and Production" (arXiv:2512.24968, updated April 2026) used a difference-in-differences design with web panel data from October 2022 through June 2025 to measure traffic changes at publishers that added AI bot blocks relative to unblocked peers. A network security report published in August 2025 documented undeclared crawler behavior from a major AI search provider, with request volume counts and evasion techniques measured across tens of thousands of domains. A May 2025 arXiv preprint (arXiv:2505.21733) examined 130 self-declared bots across 3.9 million requests from 36 websites over 40 days, testing whether bots fetched and respected robots.txt directives under controlled conditions.
Blocking Rates by Bot and Purpose
CCBot was blocked by 75% of the 100 sites studied, ClaudeBot by 69%, PerplexityBot by 67%, and GPTBot by 62%. These are all crawlers that declare a training or indexing purpose. The retrieval counterpart — Perplexity-User, which handles live user queries rather than training-data collection — was blocked by only 17% of sites.
The 50-percentage-point gap between PerplexityBot (67% blocked) and Perplexity-User (17% blocked) is not explained by any technical documentation: both are bots from the same operator, crawling the same pages. The gap reflects site-level configuration decisions, many of which were made in 2023 and 2024 when the training-vs-retrieval distinction was not yet widely understood. A configuration that disallows the training bot while allowing the retrieval bot limits the operator's ability to add new training data from that site; a configuration that blocks both cuts off both data collection and live-search citation eligibility simultaneously. Most existing block rules predate the retrieval distinction and were not designed with it in mind.
The BuzzStream data also showed 71% of sites blocking at least one retrieval or live-search bot — meaning the majority of sites that blocked anything applied their rules to both pipeline types. The operational effect is that most sites running AI bot restrictions have foreclosed AI citation eligibility along with training data access, whether or not that was the intent.
The Traffic Cost of Blocking
The Zhao–Berman paper's central finding is that blocking AI crawlers is not traffic-neutral. Publishers that added robots.txt restrictions experienced a 23.1% decline in total monthly visits and a 13.9% drop in human-only sessions relative to unblocked peers, measured through June 2025.
The 13.9% decline in human sessions is the operationally significant figure: it represents visits from organic search, direct, and referral channels — not bots themselves. The proposed mechanism runs through AI citation: if a retrieval crawler cannot access a page, AI-assisted search products cannot serve that page as a source when a user asks about a topic the site covers. As AI-assisted discovery grows as a share of referral traffic, the citation effect scales with it.
The 7% weekly traffic loss visible within six weeks of adding blocks is the near-term signal; the 23.1% monthly decline is the longer-run accumulation as citation pipelines deplete and unblocked competitors build comparative visibility in AI search results. The two figures describe the same phenomenon over different time horizons, not two independent effects.
Compliance Is Not Uniform
The May 2025 arXiv compliance study (arXiv:2505.21733) measured 130 self-declared bots across 3.9 million web requests from 36 sites over 40 days. The study found that certain categories of bots — including AI search crawlers operating in real-time retrieval mode — rarely checked robots.txt at all. Among crawlers that did fetch the file, compliance was inversely correlated with rule strictness: bots were less likely to follow a strict Disallow directive than a Crawl-delay directive. The stronger the restriction, the weaker the compliance.
The network security report from August 2025 documented a more active form of non-compliance. After one major AI search provider's declared user-agent encountered blocks across domains, the provider deployed an undeclared crawler using generic browser user-agent strings and IP address ranges that did not appear in any published verification list, rotating source ASNs to evade detection. The declared user-agent had been generating approximately 20–25 million requests per day across the network; the undeclared crawler added 3–6 million daily requests from off-list infrastructure. The dual-stream activity was observed across tens of thousands of domains simultaneously.
These two compliance failure modes are structurally different. Bots that do not fetch robots.txt represent passive non-compliance — the file's instructions are never received. A bot that fetches robots.txt, encounters a Disallow, and then routes traffic through an undeclared crawler to retrieve the content anyway is active evasion. Passive non-compliance can be partially addressed by publishing robots.txt and monitoring for new user-agents. Active evasion is not addressable via robots.txt alone and requires network-layer controls.
What This Means for Site Owners
The training/retrieval configuration split is the most actionable variable in most existing robots.txt setups. Most sites that added AI bot blocks in 2023 applied a broad Disallow to every known AI user-agent string. That configuration treats training and retrieval as a single category. Separating them — disallowing the training crawler while allowing the retrieval equivalent — preserves citation upside while restricting training data access. Implementing the split requires identifying the specific retrieval user-agent string each major AI platform uses for live query fulfillment; most operators publish user-agent documentation. The configuration change is per-agent Disallow rules, not a structural rewrite.
For operators that use active evasion techniques, robots.txt cannot serve as the primary control. IP-range blocking against officially published CIDR lists operates at the network layer independently of user-agent declarations and does not rely on the crawler fetching the robots.txt file before deciding whether to comply. The limitation is that published CIDR lists require maintenance as infrastructure changes, and not all operators publish comprehensive ranges. A practical supplement is flagging requests in access logs that carry unrecognized user-agent strings but originate from IP ranges attributed to known AI operators — the consistent fingerprint of stealth crawling behavior.
The Zhao–Berman 23.1% traffic decline figure comes from major news publishers — a cohort with high AI-search exposure and significant organic search dependence. That effect size should not be applied uniformly. Sites operating outside the top-traffic news tier, or in topics with lower AI-search saturation, are likely to see a smaller penalty from blocking. The decision framework is the same in form: measure the AI-search citation contribution to your current referral traffic, model the cost of reducing it, and compare that against the value of restricting training data access. The 23.1% is an upper bound from a high-exposure cohort, not a universal cost estimate. Block configurations set in 2023, before AI-search referral traffic was measurable at scale, are worth re-evaluating against current traffic data.
Sources
- Which News Sites Block AI Crawlers? (BuzzStream, January 2026)
- The Impact of LLMs on Online News Consumption and Production (Zhao, Berman, 2025)
- Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study (arXiv:2505.21733, May 2025)
- Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives (August 2025)