Bot Traffic · June 5, 2026

Training Crawlers vs. Live Agents: Anatomy of the AI Bot Layer

80% of AI crawler traffic is model training—not live user queries. But live agents grew 21x in 2025. Blocking training bots without understanding the split is the wrong tradeoff.

Eighty percent of the AI crawler requests hitting your servers today are batch training jobs—automated sweeps collecting content for model pipelines that may not ship for months. The remaining 20% splits into search indexers (18%) and, at just 2%, live agents actively answering real user queries. Those categories behave differently, fail differently, comply with robots.txt differently, and produce completely different downstream effects on your content's reach. Most logging dashboards treat them identically.

Two independent CDN-level analyses provided the primary dataset: one covering roughly 50 billion AI crawler requests per day across millions of customer domains; a second covering 6.5 trillion monthly requests through a WAF and bot-management layer. Both were cross-referenced against access log analysis from a major hosting network and three peer-reviewed studies tracking robots.txt adoption across the top 1 million domains (arXiv:2510.09031, arXiv:2505.21733, arXiv:2512.24968).

The Market Rearranged Itself

In May 2024, one crawler—ByteDance's Bytespider—held 42% of AI crawler request share on a major CDN network. By May 2025 it had collapsed to 7%. GPTBot moved the opposite direction: from 5% to 30%, a 305% jump in twelve months. Meta-ExternalAgent arrived at 19%. ClaudeBot dropped from 11.7% to 5.4%.

AI Crawler Request Share by User-Agent (May 2025)

Share of AI crawler requests on a major CDN network. Googlebot and traditional search bots excluded; percentages are within the AI crawler segment.

Source: CDN provider network analysis, May 2025

A second network's Q2 2025 measurement tells the same story from a different vantage point: Meta AI crawlers accounted for 52% of AI bot volume on that network, one major search and AI platform held 23%, and GPTBot's operator held 20%. Different customer mixes, same structural point: three operators control roughly 95% of AI crawler traffic. The market rearranges itself in response to business strategy, not gradual technical drift—Bytespider's plunge and GPTBot's surge both track documented shifts in training data acquisition approach.

The 80/18/2 Split—and Why the 2% Is the Story

Purpose-classifying those 50 billion daily AI crawler requests produces a figure most site owners have not internalized: 80% are training crawlers making batch sweeps, 18% are search indexers, and 2% are live user-driven agents actively answering a question right now.

AI Crawler Requests by Purpose (trailing 12 months, 2025)

80% of AI crawler volume is model training. Live-query agents are just 2% but grew over 15x during 2025.

Source: CDN provider network analysis, 2025

The 2% live-agent share looks negligible until you examine growth rates and the causal chain to referral traffic. ChatGPT-User—the user-agent sent when a ChatGPT browsing session triggers a live web fetch—grew 2,825% between May 2024 and May 2025, and over 21x across full-year 2025. Live-agent crawling in aggregate grew over 15x during 2025. Training crawlers generate volume 8x larger than search indexers and 32x larger than live-agent crawlers—but training crawlers do not send referral visitors back to your site. One operator's crawl-to-refer ratio peaked at 500,000 crawls per single referred visit in early 2025, before that operator launched its own live web search product. After the launch, the ratio dropped to roughly 38,000 crawls per referral.

The robots.txt Situation Is Messier Than It Looks

Publisher blocking of AI crawlers has escalated steeply. Among top news publishers, 79% now block at least one AI training bot via robots.txt. Across the top 1,000 domains broadly, roughly 25% restrict AI crawlers. GPTBot is disallowed in 11.7% of all analyzed domains as of early 2026—the highest blocking rate of any single AI crawler.

AI Crawler Block Rates by Site Category (robots.txt, 2025-2026)

News publishers block at 7x the rate of the broader web. A 2025 economic analysis found blocking correlates with 23% lower total monthly traffic.

Source: BuzzStream/Press Gazette publisher analysis; arXiv:2510.09031; CDN provider network data

The compliance picture is bleaker than the blocking rate implies. Research presented at ACM IMC 2025, covering 130 declared bots tested against 36 controlled websites over 40 days, found that AI search crawlers rarely check robots.txt at all. Some bots never fetched the robots file during the study window. Others fetched it and ignored directives, or deployed undeclared secondary crawlers operating under different user-agent identities to circumvent blocks.

The most counterintuitive finding is the traffic impact. An economic analysis of 500 top news publishers spanning two and a half years of data found that publishers blocking AI crawlers via robots.txt saw a 23.1% decline in total monthly visits and a 13.9% decline in human-only browsing. Blocking appeared to reduce overall search visibility—while having no measurable effect on whether AI-generated responses cited those publishers.

Training vs. Live: Different Log Signatures

Training crawlers operate on cycles measured in days. The log pattern: sequential fetches of full HTML documents at measured intervals, often starting from the sitemap. No session correlation—each URL is an independent request. A 2025 site log study covering 12 production sites over 30 days found GPTBot's median revisit interval at 2.4 days for high-traffic pages, compressing to 1.6 days when a fresh Last-Modified header signals updated content.

Consistent across CDN-level analyses: all major AI crawlers operate exclusively from data center IP ranges, unlike Googlebot which distributes crawls geographically. They also waste a significant share of capacity—both GPTBot and ClaudeBot spend over 34% of requests on 404 error pages. A current sitemap eliminates most of that overhead.

Live agents fire in response to user queries, meaning they need rendered content from a specific URL within seconds. They hit dynamic paths and linked content within query context more than sitemapped URLs. They respond to content signals: an empty JavaScript container or a paywall stub stops a live agent at the parse step; a training crawler stores the DOM state regardless. The practical requirement for being surfaced in live AI search responses is not robots.txt configuration—it is delivering a usable HTML response under roughly 500 milliseconds.

What This Means for Site Owners

Segment your robots.txt by crawler purpose, not just user-agent name. A blanket Disallow: / applied to all AI bot identities blocks training scrapers and live search agents together. If your content is non-paywalled and you want presence in AI-generated answers, live-agent crawlers (ChatGPT-User, PerplexityBot, OAI-SearchBot) should stay allowed; batch training crawlers (GPTBot, Bytespider, CCBot) are the ones you can disallow without cutting off the referral channel.

Treat the 2% live-agent share as a leading indicator, not a current measure. A 21x annual growth rate compounding for two more years puts live-agent traffic in a materially different position from where it sits today. Sites currently seeing no AI referral traffic are most likely being crawled by live agents and failing at the render step—not the crawl step. Pre-rendered HTML that loads under 500ms is the gating requirement, not robots.txt settings.

Reconsider blanket training blocks in light of the traffic data. The 23% total-visits decline represents a concrete cost that most robots.txt blocking decisions have not accounted for. The working hypothesis is that crawl coverage and traditional search ranking are correlated, so removing training crawl access affects the entire distribution chain, not just AI attribution.