Which AI bots are actually sending you traffic — and which are just hoovering up your content?
GPTBot crawls 1,700 pages for every referral it sends back. The leading training bot's ratio is 73,000 to one. Here's what the real data says about which AI crawlers are worth optimising for.
There were more than 50 billion HTTP requests from AI crawlers hitting one major CDN network every day back in March 2025. By mid-2026, that number has only kept climbing. But here is a figure that deserves equal attention: GPTBot crawls approximately 1,700 pages for every single referral it sends back to your site. For the leading training-only bot, that ratio sits around 73,000 to one. So before you sink engineering time into making your content more AI-discoverable, it is worth asking — discoverable to which bot, and to what actual end?
Where does this data come from?
The figures here draw on two independently published analyses covering 2025 and the first half of 2026. Akamai's AI Pulse series monitors bot activity across a large edge network and has been publishing quarterly breakdowns since early 2024. A major CDN operator's Radar platform cross-publishes bot traffic share data monthly, and their engineering team has matched crawl events to downstream referral signals in GA4 to produce crawl-to-referral ratios. The purpose-breakdown figures (training vs search vs live user fetch) come from a 28-day window to late June 2026. None of the numbers share a single source, so there is some natural cross-validation baked in.
So who is actually crawling your site right now?
As of May 2026, Googlebot still leads, taking 27.26% of all AI-adjacent bot HTTP requests across one major CDN network. GPTBot sits second at 11.48%. Bytespider (ByteDance's crawler) is third at 10.25%, and ClaudeBot fourth at 9.73%.
What stands out here is not just the rankings — it is how quickly they shift. In April 2026, ClaudeBot led GPTBot 11.69% to 9.84%. One month later the positions had swapped. Bytespider nearly doubled its share over those same four weeks. If you have been treating bot-mix as a stable planning input, it is not. The leaderboard at the top of your server logs looks meaningfully different every quarter.
Worth separating out: user-initiated requests from major AI assistants — where a real person asks a question and the assistant fetches a live page to respond — now represent roughly 11.3% of all individual bot traffic across the web. That puts them second only to Googlebot at 14.2%, and nearly double GPTBot at 5.8%. These are a fundamentally different category from batch training crawlers. They fire in response to real user intent, they expect a fast and coherent response, and they have a much more direct path to referral traffic.
What are these bots actually doing when they crawl you?
Mostly building someone else's model, not sending you readers.
In the 28 days to late June 2026, 52.3% of all AI crawler requests were explicitly tagged as training-purpose traffic. Mixed-purpose crawling (training plus retrieval) added another 33.0%. Live search indexing — the variety that can generate actual referral clicks — accounted for just 9.3% of requests.
This was even more skewed earlier in 2025, when training-purpose crawling ran at roughly 80% of all AI bot traffic, with search at 18% and user-driven fetches at around 2%. The ratio has shifted somewhat as agentic AI has scaled up — agentic AI traffic grew over 7,800% across 2025 — but the core pattern has not changed. The vast majority of AI bot requests hitting your origin are content going into someone else's model, not someone clicking through to your site.
Is there actually a crawl-to-referral gap?
Yes, and it is larger than most people realise.
When you look at which bots drive measurable referral traffic per crawl event, the spread is extreme:
- PerplexityBot: roughly 194 crawls per referral sent (mid-2025 figures)
- GPTBot: around 1,700 crawls per referral
- Leading training-only bot: approximately 73,000 crawls per referral
PerplexityBot links to a source in every answer it generates. Crawls translate into clickable citations, which show up in GA4 as referral traffic you can actually measure. The other major training crawlers are building a model or a retrieval index — your content may never surface as a hyperlink, and even when cited in a response, it often is not a clickable link.
Does this mean you should block training crawlers entirely? That is a harder call than the ratio alone suggests. There is a reasonable case that presence in training data improves the odds of an AI assistant recommending your site when someone asks a relevant question. That effect is plausible but unproven, and the bandwidth cost is very real. If your site is absorbing 10,000 training crawl requests per day at a 73,000:1 ratio, you should expect close to zero referrals from that specific bot.
Does the type of content you publish change anything?
Quite a lot, actually.
Shopping sites absorb a disproportionate share of AI crawler traffic. Commerce content accounts for 26.3% of all verified bot crawl traffic globally — and AI crawlers specifically skew even harder toward it, sending 31.7% of their requests to shopping pages. If you run an e-commerce site, you are likely seeing an AI crawler load that is meaningfully above the cross-sector average.
There are also notable differences in how different bots approach page types. One major AI crawling platform targets HTML content in 57.7% of its fetches and JavaScript files in 11.5%. Another major training crawler allocates 35.2% of fetches to images and 23.8% to JavaScript. Neither category executes JavaScript during the crawl — which means anything that requires client-side rendering is invisible to them regardless of how much time you have spent on discoverability.
What should site owners actually do with this?
The first practical question is whether it makes sense to serve the same response to all AI bots. Given the crawl-to-referral spread above, there is a real case for treating real-time search agents — PerplexityBot and similar — with more care than pure training crawlers. Faster responses, richer schema markup, pre-rendered HTML: these investments are more directly connected to traffic you can measure. For training-only crawlers with extreme ratios, throttling or serving a leaner response is worth considering, especially if your origin costs are material.
If your site depends heavily on client-side rendering, the second question is whether AI bots can see your content at all. For most modern frontend frameworks, without explicit pre-rendering, the answer is no. A training crawler hitting your single-page application gets back an empty shell. You could spend months optimising AI discoverability and have it make zero practical difference if the bot receives a blank page. Checking what a static-fetch bot actually receives from your key URLs is the sensible first step.
The third thing worth watching closely is agentic AI traffic. User-initiated fetches from the major AI assistants are growing fast — they now sit just behind Googlebot in volume across the web — and they behave differently from batch crawlers: real-time, intent-driven, with a more direct path to a follow-on visit if your content answers the question well. Treating them like training crawlers is leaving referral traffic on the table.