Bot Traffic · June 21, 2026

Nine Out of Ten AI Crawler Requests Will Never Return a Citation

Only 9.3% of AI crawler traffic in May 2026 is search-purpose — the kind that can generate a referral. The other 90.7% is extraction. Here is what that split looks like across your log files.

Nine in ten requests from AI crawlers to your server in May 2026 were extraction events — content taken to train or augment a model, with no citation, no referral, and no ranking signal going back to your site. One global CDN operator network data puts search-purpose AI crawling — the fraction capable of producing a citation — at 9.3% of all AI bot requests for the month. Automated requests now account for more than half of all HTML traffic on the global web; understanding the breakdown within that bot traffic determines which crawls are worth serving well and which are pure cost.

Method

Data for this analysis comes from three sources: monthly aggregate reports compiled from network-edge telemetry covering billions of HTTP requests per day, a 30-day server log study across 12 sites spanning SaaS, e-commerce, and documentation verticals, and blocking-rate analysis across millions of robots.txt files. Crawler classifications follow published user-agent strings; purpose attribution — training, mixed, or search — derives from operator documentation and observed crawl behavior patterns. All figures are from Q1–Q2 2026 unless noted.

Who is showing up in your logs

Share of AI-Adjacent Bot HTTP Requests by Crawler — May 2026

Percentage of AI-adjacent HTTP requests by crawler user agent, May 2026. Googlebot included as reference baseline.

Source: CDN operator network radar data, May 2026

By volume, Googlebot remains the largest single bot on the web at 27.3% of AI-adjacent HTTP requests in May 2026. GPTBot holds 11.5%, Bytespider (ByteDance training crawler) 10.3%, and ClaudeBot 9.7%. The leading search-retrieval crawlers together account for under 5% of total AI bot requests, despite being the only bots that reliably produce a citation or referral back to the crawled page.

Bytespider warrants a separate note. Its share grew from 3.6% in March to 6.5% in April and 10.5% in May — a 192% three-month rise, the most sustained surge by any single crawler in 2026. It has now passed ClaudeBot to become the third-largest AI crawler by HTTP request volume. Its traffic skews heavily toward product listing pages on e-commerce sites, not blog or documentation content, and no consumer-facing search product is tied to it. For e-commerce operators, Bytespider is the clearest example of high-volume crawling with no referral return.

The purpose split that changes everything

AI Crawler Requests by Purpose — May 2026

Training, mixed-purpose, and search-retrieval shares of all AI bot HTTP requests. Source: web-edge telemetry aggregate, May 2026.

Source: Monthly AI Crawler Report, May 2026

Training-only crawls accounted for 51.8% of AI bot requests in May 2026. Another 35.7% were classified as mixed-purpose — operators that use the same crawler for both training corpus ingestion and model retrieval. Actual search-purpose crawling, the kind tied to a live user query that can return a citation, reached only 9.3%. That search share has grown from under 7% in Q4 2025, led by the scaling of real-time retrieval infrastructure among the dominant AI assistants. But the fundamental ratio remains stark: for every request that might eventually surface your URL in an AI answer, nine others are silent extractions.

The practical consequence is that evaluating AI traffic by aggregate request count means measuring content expenditure, not content return. A site receiving 50,000 AI crawler hits per day could be generating fewer than 5,000 requests from crawlers capable of citing it. The other 45,000 are training pulls — bandwidth consumed and origin load generated with zero referral in exchange.

Crawl intensity is not evenly distributed

Median Daily Crawler Hits per Site by Bot

Median hits per site per day across a 12-site sample spanning SaaS, e-commerce, and documentation. 30-day study period, Q2 2026.

Source: Agentic Crawler Behavior: 30-Day Site Log Study 2026

Across the 12-site log study, GPTBot averaged 4,200 hits per site per day — the most aggressive of the crawlers measured. ClaudeBot came in at 1,800, PerplexityBot at 980, and Google-Extended (the training-specific Googlebot variant) at 540. These figures vary substantially by site type: documentation sites see 2–3× higher GPTBot activity than pure e-commerce. Bytespider runs at medians closer to 6,500 daily hits on product-heavy stores.

Crawl depth also differs systematically by bot. GPTBot operates breadth-first and favors /blog/, /docs/, and /about/ paths. Bytespider prioritizes paginated product listings. PerplexityBot concentrates on recently updated pages, with recrawl intervals that correlate with Last-Modified response headers. Google-Extended largely mirrors Googlebot path selection but at much lower frequency.

Blocking adoption is rising but unevenly distributed

In Q1 2026, GPTBot appeared in 5.52% of DISALLOW rules across analyzed robots.txt files — the highest of any AI crawler — followed by CCBot at 5.08% and ClaudeBot at 4.88%. Among the top 1,000 websites, 25% now block GPTBot, up from 5% in early 2023. The jump is concentrated in news, publishing, and e-commerce; SaaS and documentation sites largely leave all crawlers unrestricted. Among news publishers specifically, 79% block at least one AI training bot — the strongest blocking concentration of any industry.

The blocking data reveals something less obvious: operators are blocking training-only crawlers far more aggressively than search-retrieval crawlers. Search-purpose bots carry materially lower block rates than their training counterparts, suggesting that site owners are beginning to distinguish between traffic that can generate a citation and traffic that cannot.

What this means for site owners

The 9.3% search-crawl fraction is not fixed — it has climbed from under 7% over the past two quarters as the dominant AI assistants have scaled real-time retrieval infrastructure. Sites that serve well-structured, fast-loading content to search-retrieval bots are positioned to capture value as that fraction grows.

First, know which crawlers are actually hitting you and what they are classified as. Check your logs for OAI-SearchBot, PerplexityBot, Applebot-Extended, and other user agents classified as search-purpose. Their hit count is the leading indicator of citation potential. The aggregate AI bot hit count is not.

Second, the high-volume training crawlers represent a server cost without a referral return. Serving a training crawler 4,000 requests per day generates meaningful origin load and bandwidth expenditure in exchange for zero referral traffic. Whether that tradeoff is acceptable is a business decision, but it should be a deliberate one and not a passive default.

Third, differentiate your content response by crawler class. A fully rendered, semantically rich document served to a search-retrieval bot improves citation probability. A minimal static response to a training-only crawler reduces server cost without materially affecting your visibility in AI search results. The crawlers use distinct user agents; routing them to different content handlers is a solved engineering problem.