What's Actually Crawling Your Site? The AI Bot Mix in Mid-2026
More than half of all web traffic is automated, and the AI crawler slice grew 187% in 2025 alone. Here's who's showing up, what they want, and why the training-vs-agent split changes everything.
If you haven't looked at your access logs recently, here's the number that should make you do it: more than half of all HTTP requests hitting websites are now automated. Bots crossed the 50% threshold for the first time in 2024. By the end of 2025, AI-specific crawlers had grown 187% year-on-year, making them the fastest-growing segment of bot traffic on the web. The part that gets buried in the headline, though? For the majority of that traffic, the crawl-to-referral ratio runs into the tens of thousands to one. Something is reading your site constantly — it's just probably not going to send you a visitor any time soon.
Where does this data come from?
The figures here draw on three independent datasets: WebSearchAPI's monthly AI crawler reports tracking HTTP request share across a large monitored sample of websites, the HUMAN Security 2026 AI Traffic & Cyberthreat Benchmark Report covering full-year 2025 data, and edge-network telemetry that aggregates hundreds of billions of daily requests by crawler type and declared purpose. These are observed traffic counts, not survey estimates.
So who's at the top of the crawler leaderboard right now?
Four crawlers dominate the AI bot leaderboard by request volume as of May 2026. Meta-ExternalAgent led at 13.1% of AI bot HTTP requests, with GPTBot close behind at 11.5%. But the most striking story of 2026 so far is Bytespider — ByteDance's AI training crawler — which has surged from 3.6% in March to 10.5% in May, a 61% jump in a single month and its third consecutive month of growth. At that pace, it's on track to overtake GPTBot before the year's end.
ClaudeBot rounds out the top four at 9.7%. It briefly held second place in April at 11.7% before GPTBot edged back ahead in May. These two have been trading positions for much of 2026. The long tail — PerplexityBot, OAI-SearchBot, Applebot-Extended, Amazonbot, and a dozen or so others — collectively make up the remaining share of AI bot requests. The leaderboard is moving faster than ever: Bytespider wasn't in the top ten six months ago.
Are all these crawlers actually sending anyone back to your site?
The purpose breakdown is where things get interesting — and where the standard "total bot traffic" metric starts to mislead. Only 2.3% of AI crawler requests in H1 2026 were classified as "user action", meaning a real person triggered that fetch in real time. Training accounts for 48.5%, mixed-purpose (training plus retrieval) adds another 40.2%, and search/indexing covers 8.5%. Roughly 88% of AI crawler traffic is feeding model weights or building retrieval indexes rather than returning a human to your site.
This is consistent with what crawl-to-referral analysis has found across major training-focused bots: for some of the highest-volume crawlers, you can see tens of thousands of page fetches before a single referral visit shows up in your analytics. Your server absorbs the request cost; the human benefit lands somewhere else. That's not necessarily a reason to block everything — real-time AI search bots do send referrals — but it does mean the category of "AI bot traffic" contains some very different things that probably shouldn't be handled the same way.
The picture is also shifting. Data from a full-year 2025 study shows training crawlers declined from around 90% of all AI-driven traffic in January 2025 to 74% by December, while agentic bots — autonomous agents navigating the web on behalf of a real user — emerged as a distinct category at 1.7% of total AI-driven traffic. Agentic traffic grew faster than any other bot segment during 2025. It's small in absolute terms, but it's the segment most likely to actually convert.
Why do different bots behave so differently from each other?
Not all AI crawlers work the same way, and the differences have real consequences for access control and content delivery.
What they fetch. GPTBot runs 57.7% of its requests against HTML content but doesn't execute JavaScript — client-rendered pages can appear near-blank to it. ClaudeBot runs a notably higher share of image requests (35.2% of its fetches) compared to most other crawlers, which suggests a training pipeline building stronger multimodal capabilities alongside text. If your site relies heavily on client-side rendering, GPTBot may be seeing a significantly thinner version of what you've built.
How they handle robots.txt. Compliance varies significantly between crawlers. Some check your robots.txt multiple times per day before any crawl begins. Others rarely check it, or don't check it at all. If you've added AI crawler disallow rules to your robots.txt and certain bots are still generating substantial traffic, this is almost certainly why. robots.txt alone isn't sufficient access control for the full crawler ecosystem.
Crawl patterns and burst behavior. GPTBot has been documented arriving on a site and executing over 150 requests in a single 3-minute burst, then going quiet for weeks. The pattern looks like triggered activation — something causes the crawler to target a site, it sweeps aggressively, then disappears. For site operators, this can look like a traffic anomaly in your logs if you're not expecting it. The geographic angle adds nuance too: North America accounts for the highest concentration of crawling activity by volume, with roughly 90% of North American AI bot traffic coming from training-focused crawlers rather than real-time search agents.
What should you actually do about this?
Stop treating AI bots as a single category. The gap between a training crawler that will never send you a visitor and a real-time AI search bot that surfaces your content in live answers is enormous. Blanket blocking cuts both equally, but the business case for allowing them is completely different. Targeted rules by user-agent — with different handling for training, search, and agent traffic — is a far more defensible approach than all-or-nothing.
Know what kind of site you're running. E-commerce accounts for 26.3% of AI bot traffic by industry, because product data — prices, descriptions, availability — is exactly what AI products want to read. Media and publishing sites see 7x more AI bot traffic than the average website. If your site is content-dense or data-rich, you're already getting disproportionate crawler attention. Being intentional about which crawlers you serve, and what content you serve them, is basic site hygiene at this scale.
Track the bot mix, not just total automated traffic. Bytespider's three-month surge is the clearest reminder that the AI crawler landscape shifts faster than most monitoring cycles update. A bot you've never heard of can go from negligible to top-five in a quarter. Watching your bot traffic by user-agent — not just aggregate automated vs human — gives you early warning when something new starts putting load on your infrastructure. It also tells you whether the crawlers you're seeing are the kind that might eventually send you referrals, or the kind that are just collecting training data and moving on.