Bot Traffic · June 7, 2026

Training Crawlers Take 80%, Return Almost Nothing: The AI Bot Split Hiding in Your Logs

In 2025, training crawlers consumed 80% of all AI bot traffic while some returned fewer than one referral per 38,000 pages crawled. Here is who is crawling, why, and what the mix means for your server.

One major AI crawler ran at a crawl-to-referral ratio of 73,000:1 in January 2025 — for every visitor it sent to a site, it had already consumed 73,000 of that site's pages. By July the ratio had declined to 38,000:1. Meanwhile, a competing retrieval bot held steady at 1,700:1 across the same period. These two numbers describe the same "AI traffic" category in most analytics dashboards. They are not the same problem.

Method

Data in this post draws on network-level radar telemetry covering tens of billions of daily requests, cross-referenced with publicly documented user-agent registrations and crawl-purpose declarations from major AI operators. Crawler share figures compare July 2024 to July 2025 as anchor points, with May 2026 snapshots for the most volatile bots. Crawl-to-referral ratios use the same network data, measuring inbound crawler requests against outbound referral clicks tracked at the CDN layer.

Bot Mix: A Year of Rapid Rotation

The AI crawler landscape rewrote itself between July 2024 and July 2025. GPTBot more than doubled its share of AI crawler traffic, rising from 4.7% to 11.7%. ClaudeBot grew from 6% to approximately 10%. In the same window, Amazonbot fell from 10.2% to 5.9%, and Bytespider — formerly one of the largest AI crawlers — collapsed from 14.1% to 2.4% after facing blocks from multiple publishing platforms and scrutiny over robots.txt compliance violations.

AI Crawler Share: July 2024 vs July 2025 (%)

GPTBot doubled while Bytespider lost 83% of its share in 12 months.

Source: CDN network radar data (tens of billions of daily requests)

The Bytespider contraction did not hold. By May 2026, it had recovered to 10.25%, nearly quadrupling in eight months. Applebot surged 140% in a single month during Q1 2026, jumping from 2.97% to 7.15% of AI crawler traffic with no public explanation for the acceleration. The four largest crawlers — GPTBot, Meta-ExternalAgent, ClaudeBot, and Bytespider — now represent approximately 74.3% of all identified AI bot requests. No individual crawler's footprint should be treated as stable across quarters.

The 80/18/2 Split: What Crawlers Are Actually Doing

Across a rolling 12-month window ending mid-2025, 80% of all AI crawler traffic served training-data collection, 18% served search retrieval — pulling content to answer user queries — and just 2% was user-triggered browsing, where a live user queries an AI assistant and the crawler fetches real-time context. In the six months prior to mid-2025, the training share grew further to 82%.

AI Crawler Traffic by Purpose (12-month rolling, mid-2025)

80% of AI crawler requests served training data collection; only 2% were user-triggered.

Source: CDN network radar — AI crawler purpose classification

Training crawlers and retrieval crawlers leave distinct signatures. Training bots typically hit a URL once, work through wide crawl queues without revisiting, and pull full page HTML. Retrieval bots operate in real-time response to user queries: they revisit recently modified content, cluster around high-velocity pages, and recrawl more frequently than training bots do. A bot that hits the same article page twice in a single day is almost certainly retrieval — your content may be about to appear in an AI-generated answer. A bot that crawls your full archive once and never returns is almost certainly training.

AI crawler traffic overall grew 187% from January to December 2025 across observed networks. User-driven crawling — where a real user's query triggers the crawl — grew 15x across the same calendar year. As of Q1 2026, AI crawlers represent 22% of all bot traffic, ahead of SEO tool bots and advertising crawlers, surpassed only by traditional search engine spiders.

Crawl-to-Referral: The Reciprocity Gap

The ratio of pages a crawler consumes to referral visits it sends back is the sharpest measure of value exchange. Variation across bots spans orders of magnitude.

Crawl-to-Referral Ratio by AI Crawler (mid-2025)

One training crawler consumed 38,000 pages per referral visit; PerplexityBot stayed under 200:1.

Source: CDN network radar — crawl vs referral traffic comparison

One training-heavy crawler peaked at 73,000:1 in January 2025, declining to 38,000:1 by July. A retrieval-focused competitor held between 1,276:1 and 1,700:1 across the same period. PerplexityBot, which delivers live query results directly linking to source pages, maintained a ratio under 200:1 through most of the second half of 2025. For news and publications specifically, the ratios compress across the board: one AI search bot ran at 152:1, another at 2,500:1, and a third at 32.7:1.

These ratios invert the assumption that more AI crawler traffic is better. High crawl volume from a training-only bot is a bandwidth cost with no measurable attribution upside. High volume from a retrieval bot may indicate your content is actively ranking in AI-powered results. Bytespider has been documented crawling at approximately 25 times the request rate of GPTBot — making it cheap to block and expensive to ignore if it converts to a retrieval use case.

Industry Concentration

Retail and e-commerce drew 28.89% of all AI crawler traffic in 2025, the most-crawled sector by a significant margin, followed by streaming, media, and travel. For news and publications, GPTBot accounted for 17.4% of inbound AI crawler requests in that sector, while the real-time browsing user-agent from the same source accounted for 14.9%. The near-equal split between training and live retrieval in news content suggests articles feed both pipelines simultaneously — and that a single piece of content can serve a model training corpus and a live answer engine within the same week.

What This Means for Site Owners

A training crawler at 38,000:1 and a retrieval crawler at 200:1 both appear as "AI bot traffic" in most access logs. They call for opposite responses. For retrieval crawlers, structured data, clear entity definitions, and schema markup improve the quality of content pulled into AI answers. For training crawlers, the question is whether the brand exposure justifies the bandwidth cost — and whether robots.txt exclusions are worth filing given documented compliance gaps from some operators.

The first practical step is log segmentation. Identify which crawlers in your access logs declare themselves as training-only versus retrieval-enabled. Most major AI operators publish user-agent documentation and crawl-purpose declarations. Map your top ten bots by request volume to their stated purpose. The distribution is often lopsided: a site receiving 500,000 AI crawler hits per month with a 90% training split sits in a fundamentally different position than one with the inverse — even when the raw log counts look identical.

The bot mix volatility adds a second constraint. Bytespider's 83% collapse and subsequent 300% recovery within 18 months, plus Applebot's unexplained single-month surge, show that static bot policies age quickly. Effective management means treating purpose and user-agent as separate axes — blocking or rate-limiting by declared purpose rather than by name — with quarterly review cycles to catch new crawlers and behavioral shifts in existing ones.