Bot Traffic · June 29, 2026

What Happens When You Block AI Crawlers? News Publishers Just Found Out the Hard Way

82% of news publishers now block at least one AI crawler — and a peer-reviewed Wharton study found those that did lost 7% of weekly traffic within six weeks. Here is what the data actually shows.

82 percent of major news publishers now block at least one AI crawler in their robots.txt. That is the highest blocking rate of any content category on the web — sitting 37 percentage points above the all-sector average of 44.9%, according to a June 2026 audit of 122 websites across ten content categories.

Then a peer-reviewed study quietly dropped a number that complicated the whole picture.

Researchers at the Wharton School (University of Pennsylvania) and Rutgers Business School tracked news publishers who had implemented those blocks and found they lost roughly 7% of their weekly website traffic within six weeks of blocking — and that decline showed up in Comscore household browsing data, not just server-side log estimates. A separate analysis of the largest publishers found traffic losses closer to 23%. Whether the mechanism is direct (AI assistants reducing citations to blocked sites) or indirect (something structural about how AI indexing and content recommendation interact), the numbers point in an uncomfortable direction for publishers who assumed robots.txt was a clean opt-out.

Where does the data come from?

The 82.4% blocking rate figures are from a June 13, 2026 audit using standard HTTP GET requests to fetch and parse public robots.txt files from 122 websites across ten content categories — 107 of which returned a parseable file. The academic study is "Strategic Response of News Publishers to Generative AI" by Hangcheng Zhao (Rutgers Business School) and Ron Berman (Wharton School), last revised April 21, 2026 on SSRN, drawing on SimilarWeb, Semrush, and Comscore data.

The per-crawler blocking mention rates come from a March 2026 analysis of approximately 4,000 public robots.txt files. The Bytespider growth trajectory is from the WebSearchAPI.ai monthly AI crawler report series, tracking CDN-level bot traffic through May 2026. The ChatGPT-User vs Googlebot comparison is from a 55-day proxy request study (January–March 2026) spanning 24.4 million requests across 78,000 pages on 69 production sites.

Which crawlers are publishers actually blocking?

Among the roughly 4,000 robots.txt files parsed in March 2026, GPTBot — the training data crawler operated by the company behind ChatGPT — appeared in 13.8% of DISALLOW rules. ClaudeBot sat at 11.5%, CCBot at 11.2%, and Google-Extended at 10.7%.

That last figure is worth pausing on. Google-Extended is the AI training crawler kept separate from standard search infrastructure — it is the crawler used for AI model training, not the one that drives organic search traffic. Publishers are blocking it at nearly the same rate as the major AI assistant crawlers. Standard search crawlers from the same companies almost never appear in DISALLOW rules from the same sites.

AI Crawlers Named in robots.txt DISALLOW Rules (% of 4,000 Files, March 2026)

GPTBot and ClaudeBot appear in roughly 1-in-8 public robots.txt files. Google-Extended is blocked at nearly the same rate despite being separate from the traffic-driving standard crawler.

Source: robots.txt AI Crawler Blocking Report, March 2026 — TechnologyChecker.io

The logic seems clear: publishers are trying to draw a line between crawlers that send referral traffic and crawlers that take content for training without compensation. Whether that line is actually working is the question the Wharton/Rutgers study pokes at.

What is Bytespider, and why does it matter for your blocking decisions?

Bytespider — operated by ByteDance — grew from 3.6% of AI crawler traffic in February 2026 to 10.5% in May, three consecutive months of accelerating growth. It is now the #4 AI crawler globally, overtaking ClaudeBot in market share.

Bytespider Share of AI Crawler Traffic (Feb-May 2026)

Three consecutive months of growth make Bytespider the fastest-rising AI crawler of 2026, now ranked #4 globally ahead of ClaudeBot.

Source: WebSearchAPI.ai Monthly AI Crawler Report, May 2026

Why does that matter for publishers making blocking choices right now? Because Bytespider's referral mechanism is less transparent than the major AI assistant crawlers. Its crawl volume is large and growing fast, but the path from "Bytespider indexed your content" to "a user gets directed to your site" is not as clearly documented. If you are framing blocking decisions around a training-cost-versus-referral-upside trade-off, Bytespider fits a different risk profile than the crawlers tied to AI assistants with large, well-documented user bases in your market.

At its current pace, Bytespider could match GPTBot's share by mid-2026, which would make ByteDance's crawler the third-largest AI bot on the web.

The training crawler and the retrieval agent are not the same thing

This is probably the most underappreciated nuance in the AI crawler blocking conversation: the company behind ChatGPT operates two entirely separate crawlers. GPTBot is the batch training crawler — it harvests content for model training runs. ChatGPT-User is a real-time retrieval agent that fetches pages when a user is actively asking a question that requires up-to-date web content.

The 55-day proxy request study found ChatGPT-User made 3.6 times more requests than Googlebot over the study period. In absolute terms, it outpaced Googlebot, Amazonbot, and Bingbot combined.

Blocking GPTBot affects training data collection only. Blocking ChatGPT-User affects whether your content can appear in AI-generated answers to live queries — the mechanism that actually puts humans on your page. Most robots.txt DISALLOW rules do not distinguish between them. A broad rule targeting the parent company's bots by domain or partial user-agent match often catches both crawlers in a single directive.

Training crawls accounted for 52.3% of all AI crawler requests in the 28-day period to June 22, 2026. Real-time user-action requests sat at 2.6%. By raw request count, batch training dominates. But in terms of which requests have any path to driving a human visit, the retrieval slice carries most of the weight — and blanket blocking catches both.

What should you actually do with this?

Write bot-specific DISALLOW rules, not company-level blocks. GPTBot and ChatGPT-User have distinct user-agent strings. So do ClaudeBot and the retrieval agent that operates alongside it. If your goal is to opt out of training data collection specifically, target the training crawlers by their exact user-agent string and leave the retrieval agents accessible. This takes about ten minutes to update in robots.txt and avoids cutting off the referral-generating requests along with the training ones.

Audit your server logs before committing. The 7% traffic decline is an aggregate across many publishers — your site's specific picture could be very different. Pull your access logs for the crawlers you are considering blocking. Check whether you are seeing any correlated AI-attributed referral traffic in your analytics. The Zhao/Berman figure is a good reason to model the downside before blocking rather than after.

Hold off on blocking Bytespider for now. Its growth is real and the crawl volume is already substantial. What remains unclear is whether blocking it has material cost, since referral data for Bytespider is thinner than for AI assistant crawlers with documented user bases in your market. The conservative approach is to monitor what you see in your access logs and attribution data before making a call.

The news publishing sector has framed this almost entirely as a content rights question. The Wharton/Rutgers study adds a data dimension that is harder to dismiss: sites blocking AI training crawlers in 2026 measurably lost human traffic. That does not settle the question — but it does mean the decision is more complicated than a simple opt-out from an ecosystem you assume is taking without giving back.