Training Bots vs. Retrieval Bots: Has the AI Crawler Mix Shifted More Than You Think?
Bots now account for 57.5% of HTML web traffic — the first bot majority. But the real shift is inside: training crawlers are losing share to real-time retrieval bots that want fresh content, not archives.
Bots now account for 57.5% of HTML web traffic — the first time automated requests have outnumbered real users on the open web. But if you manage a website and you're trying to work out what to actually do about that, the top-line figure is almost beside the point. The more useful question is: what kind of bots have changed? A year ago, nine in ten AI crawler requests were training runs — large batch sweeps building or refreshing model datasets. By mid-2026, that share has dropped to somewhere between 52% and 74%, depending on how you count mixed-purpose crawlers. The gap has been filled by real-time retrieval bots: scrapers feeding live answers into AI search engines. These two types of crawler behave nothing alike, and lumping them together in your bot management strategy is an increasingly costly assumption.
Where does this data come from?
Three sources underpin this post. The HUMAN Security 2026 State of AI Traffic & Cyberthreat Benchmark Report covers trillions of HTTP interactions across their detection network and breaks AI bot traffic down by intent: training, scraping, and agentic use. The WebSearch API monthly crawler report for May 2026 gives per-crawler HTTP request share figures from verified bot traffic analysis. And a 30-day site log study from Digital Applied, covering 12 production websites across March and April 2026, is the source for the crawl-frequency figures below — data that's hard to find anywhere else. Where figures from different sources appear to conflict, we note it.
So which AI bots are actually showing up in your logs right now?
Among the crawlers that AI infrastructure operators are running — setting aside traditional search bots for a moment — the May 2026 picture looks like this:
GPTBot holds 11.48% of verified AI bot HTTP requests. ClaudeBot is close behind at 9.73%. The month's headline was Bytespider, the crawler operated by ByteDance, which surged 61% month-on-month to reach 10.5% — pushing past several established crawlers to rank fourth across all AI-adjacent verified bots. Googlebot still dominates overall at 27.26% of AI-adjacent bot requests, but the interesting story is in the AI-specific tier, where the competitive picture changed substantially in a single month.
The top five AI infrastructure operators still account for 69.5% of total AI crawler traffic — but that's down from 73.9% in April, meaning the long tail is picking up. Is the concentration of crawling power starting to erode? Possibly. Watch the next few monthly reports.
Has the training vs. retrieval split shifted as fast as you'd expect?
Here is the number that probably deserves more attention than it's getting: in early 2025, roughly 90% of all AI-driven crawler requests were classified as training runs. By the end of 2025, that had dropped to around 74%, with scraper and retrieval bots growing from 10% to 24% of the mix. By May 2026, CDN network analysis puts pure training traffic at roughly 52%, with mixed-purpose crawlers (doing both training and retrieval) adding 36%, and pure search-only retrieval at 9%.
AI scraper traffic specifically grew 597% between January and December 2025. That figure is worth pausing on: a category that barely existed as a distinct measurement two years ago now represents the fastest-growing segment of AI bot activity on the web. And these scrapers are not building archives. They are answering a user's live question by checking whether your pricing page, your product description, or your technical documentation has been updated since the last time they visited — which may have been yesterday.
What does this shift mean in practice? If you see a spike in AI bot hits on your server logs, the odds that those bots are running a training sweep are lower than they were eighteen months ago. Increasingly, they want to know what your site looks like right now.
How hard are these bots actually hitting sites?
Crawl frequency is where the difference between training bots and retrieval bots gets concrete — and where the implications for your server infrastructure and bot management start to bite.
A 30-day log study across 12 production websites found GPTBot averaging 4,200 hits per site per day. ClaudeBot came in at roughly 1,800 hits per day. Bytespider reached up to 6,500 hits per day on e-commerce sites with dense product listings, making it the most intensive crawler measured by raw hit volume in the study. PerplexityBot averaged around 980 hits per day — fewer than the others — but revisited active pages every one to three days, meaning it checks back more consistently than the bigger-volume bots.
Compare those figures with a typical training crawler: one sweep, then potentially weeks or months before the next visit. If your site requires JavaScript execution to render any meaningful content — if an AI crawler fetching your homepage gets back an empty shell of HTML and a pile of script tags — a retrieval bot that misses today will be back tomorrow. A training crawler that misses today may not bother again for a long time.
What should site owners actually do with this?
The training/retrieval distinction matters most if your goal is to appear in AI-generated answers — the product comparisons, cited sources, and recommendation summaries that increasingly show up where traditional search results used to be. For that outcome, retrieval bots are the more direct pipeline: they read your content today and an AI assistant quotes it in a user query tonight. If those bots hit a JavaScript-only page and come back empty-handed, you've missed that window.
Should you block training crawlers to protect your content? The answer is murkier than it might seem. Training runs are still the mechanism by which AI models build up their internal understanding of what your site covers and what it's authoritative on. GPTBot is already listed in DISALLOW rules on 5.52% of verified sites — the most-blocked AI crawler in robots.txt analysis — but what effect that has on citation frequency downstream is not yet established. Blocking training access may reduce your baseline presence in model knowledge even if it protects your content from dataset reuse.
The more useful approach is to differentiate by bot class rather than blocking by default. Let retrieval bots reach fast-loading, semantically well-structured pages — server-rendered HTML with clear schema markup, not a skeleton waiting for client-side hydration. Use training-bot visits as an opportunity to serve richer, more context-dense content, since those crawlers are building long-term representations of what your site is about. And add a server-log filter that separates bot categories by user agent string. The names are distinct enough that this is a configuration change — a handful of web server or WAF config rules — rather than a development project.
The bot mix has shifted. The question is whether your infrastructure has kept up.