Are AI Crawlers Actually Reading Your robots.txt?
60% of reputable websites now block at least one AI crawler — up from 23% in 2023. But documented cases of undeclared browser-impersonating bots raise a harder question: how much of that control is real?
Somewhere in the past 20 months, the web quietly staged a mass lockout. In September 2023, 23% of reputable websites had any robots.txt directive targeting an AI crawler. By January 2024 that number had jumped to 50%. By May 2025 it was sitting at 60% — and it's probably higher today. So how is it that AI assistants are still summarising content from sites that have told them, in plain text, to stay away?
Where does this data come from?
Two peer-reviewed studies underpin this post. The first — Somesite I Used To Crawl, presented at ACM IMC 2025 (arxiv.org/abs/2411.15091) — measured robots.txt behaviour across 51,605 sites drawn from the Tranco Top 100k, tracking crawler directives monthly from October 2022 through October 2024. The second, published in October 2025 (arxiv.org/abs/2510.10315), compared blocking patterns between 3,369 reputable news sites and 710 misinformation sites to understand who is actually deploying access controls — and what kind. On the traffic side, HUMAN Security's 2026 AI Traffic Benchmarks classifies crawler requests across a large web sample and breaks out declared versus undeclared bot volumes.
Just how fast did the blocking wave grow?
That slope is one of the fastest adoption curves ever recorded in robots.txt history — a 37-percentage-point rise in 20 months, driven almost entirely by sites that had never needed to think about crawler policy before large language model training became mainstream. The ACM IMC study found that reputable websites now block an average of 15.5 distinct AI user agents — not a single generic Disallow rule, but a maintained list with individual entries per crawler. Misinformation sites, by contrast, block fewer than one AI agent on average: only 9.1% of them block any AI crawler at all.
The contrast isn't surprising when you think about who has the most at stake. Reputable publishers — news organisations, reference sites, high-quality content producers — are exactly the sources that AI training pipelines target most heavily. They've noticed, and they've responded. By August 2025, more than 2.5 million sites had fully opted out of AI training access. That's not a fringe reaction anymore — it's become a mainstream part of content strategy.
Which bots are actually getting blocked?
The BuzzStream 2025 analysis of 100 major US and UK news publishers shows a clear pecking order among training crawlers. CCBot — which powers Common Crawl, one of the most widely used open training datasets — is blocked by 75% of publishers. ClaudeBot comes in at 69%, GPTBot at 62%. Then there's a notable gap: Google-Extended, which trains a major AI assistant's knowledge base, is blocked by only 46% of publishers — almost 30 points below CCBot.
Why the gap? Most likely because publishers are reluctant to antagonise Google in organic search rankings, even when they're comfortable blocking other AI training crawlers. Google-Extended is, functionally, Google — and breaking that relationship carries costs that blocking CCBot doesn't.
The picture for retrieval bots — the ones doing real-time fetches when a user asks an AI assistant a question — looks different again. OAI-SearchBot (used for live results by one large AI assistant) is blocked by 49% of publishers. The retrieval counterpart to ClaudeBot is blocked by 66%. PerplexityBot's retrieval agent, Perplexity-User, is blocked by just 17%. So if you've blocked the training crawlers but left retrieval bots alone, your content is probably still being served in real-time AI responses.
Does the block actually work?
Here's the uncomfortable part. robots.txt is a social norm, not a technical control. Complying with it is entirely voluntary. The good news: GPTBot, ClaudeBot, PerplexityBot, and Google-Extended all publish their user agent strings and, in normal operation, honour Disallow rules. The bad news: "normal operation" comes with an asterisk.
Documented cases show at least one major AI search engine deploying an undeclared secondary crawler — sending requests using a generic Chrome-on-macOS browser string when its declared crawler encountered a block. The stealth crawler rotated through IP ranges not listed in the operator's official IP documentation, making IP-based blocking useless against it. In some periods, it was observed skipping robots.txt fetches entirely rather than reading and respecting them.
The scale was significant: 3–6 million daily requests from the undeclared crawler, running alongside a declared crawler making 20–25 million daily requests. That's roughly one in five or six requests from that operator coming from a crawler that had actively chosen to hide what it was.
The ACM IMC 2025 study was frank about the limits of text-file-based controls: "limited efficacy against unresponsive crawlers." The researchers found that network-level crawler blocking — though much less commonly deployed — offered meaningfully stronger protection than robots.txt alone.
How much undeclared traffic is actually out there?
In aggregate, not a lot. HUMAN Security's 2026 benchmarks put undeclared AI crawler traffic at roughly 0.5% of all AI crawler requests measured. Training crawlers dominate at 52.3% of total AI crawler requests; real-time retrieval triggered by actual user queries accounts for just 2.6%.
That 0.5% sounds reassuring until you think about the denominator. AI crawlers were generating over 50 billion daily requests across major CDN networks as of early 2025. Half a percent of that is 250 million requests per day from crawlers that didn't identify themselves. For any individual publisher being targeted, the global aggregate is irrelevant — what matters is whether the specific request hitting their origin server respected their access directives.
What should you actually do with this?
If you're trying to block AI training crawlers, robots.txt will stop the compliant majority. CCBot, GPTBot, ClaudeBot, and Google-Extended all maintain published user agent strings and IP ranges, and generally honour the directives. For more reliable enforcement, cross-check incoming requests against each operator's published CIDR blocks. Anything claiming to be a known bot but arriving from an undocumented ASN is either misconfigured or actively misrepresenting itself. Treating those requests as unverified traffic — rather than extending them the trust reserved for declared bots — is a reasonable and low-overhead default.
If you want AI crawlers to reach your content, the same logic applies in the other direction: clear Allow rules, accurate structured data, and no conflicting Disallow entries that accidentally catch bots you'd actually like to let in. Given that training crawls represent 52% of all AI crawler traffic, how your content sits in future training pipelines matters as much as whether today's real-time retrieval bots can fetch it.
Either way, robots.txt is no longer a passive configuration file you set up once and forget about. It's a live access policy, and in 2026 it's worth treating it that way.
Sources
- Is Misinformation More Open? A Study of robots.txt Gatekeeping on the Web
- Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers
- Which News Sites Block AI Crawlers in 2025? [New Data]
- The 2026 State of AI Traffic & Cyberthreat Benchmark Report
- Most Major News Publishers Block AI Training & Retrieval Bots