Bot Traffic · June 4, 2026

The Crawl-to-Refer Gap: What AI Bots Take vs. What They Send Back

One AI crawler reads nearly 24,000 of your pages for every visitor it sends back. We break down the 2026 bot-traffic data: who is crawling, why, and what site owners should actually do about it.

The most aggressive AI crawler on the web today reads 23,951 pages of your content for every single visitor it sends back. That is not a typo. Between January and March 2026, network-scale measurements put ClaudeBot's crawl-to-refer ratio at nearly 24,000 to one. GPTBot, the next-heaviest, sits at 1,276 to one. At the other end, DuckDuckBot returns almost one visit for every page it reads. The spread between the best and worst behaviour is four orders of magnitude — and it changes what "letting the bots in" actually costs you.

Method

The figures here come from three public datasets published in early 2026: a network-scale analysis of crawler traffic across tens of millions of sites, an independent GEO data report compiling per-bot crawl-to-refer ratios, and a Q1 2026 web-traffic study. We cross-checked the share numbers against our own proxy logs. Our network skews heavily to traditional search bots — over 98% of identified crawler hits in the last 90 days were search engines, not AI — so for volume and ratio data at scale we lean on the public sources rather than our own small sample. All three datasets agree on direction, if not the second decimal.

AI crawlers are now a fifth of all bot traffic

Two years ago, AI crawlers were a rounding error. In Q1 2026 they account for 22% of all bot traffic and are the fastest-growing category, behind only traditional search. The leaderboard is volatile month to month. Across Q1, Googlebot led at 31.7% of crawler requests, followed by Meta-ExternalAgent at 16.3%, ClaudeBot at 11.8%, and GPTBot at 11.1%. By May, GPTBot had edged back ahead of ClaudeBot, and Bytespider had nearly doubled its share to crack the top four.

AI & search crawler share of bot requests (Q1 2026)

Googlebot still leads, but four AI user-agents now sit in the top group.

Source: Network-scale crawler analysis, Q1 2026

The takeaway is not the exact ranking — it will be different next month — but the shape: a handful of AI user-agents now generate request volumes that rival the search engines you have optimised for since 2010. If your logging and rules still treat "bot" as a single bucket, you are blind to the half of it that is growing fastest.

Most of that crawling is for training, not search

Why are these bots reading so much? Overwhelmingly to build training corpora, not to answer live queries. Over the most recent six-month window, 82% of AI crawling was for model training, 15% for search indexing, and just 3% for real-time user actions. A year earlier the training share was 72%. The trend is moving away from "a user asked a question and the bot fetched your page" toward "a bot is bulk-reading the web to train the next model."

What AI crawling is for (last 6 months)

Training dominates; live search and user actions are a small minority.

Source: Network-scale crawler analysis, Jan-Jul 2025

This matters because the three purposes have very different value to you. A search-indexing crawl can surface your page in an AI answer with a citation and a click. A user-action fetch means a real person is, right now, looking at something that referenced you. A training crawl gives you neither — your content is absorbed into a model with no link, no attribution, and no traffic. When 82% of the crawling is training, most of the bandwidth you serve to AI bots returns nothing measurable.

The crawl-to-refer gap is the number that matters

Which brings us back to the opening figure. The crawl-to-refer ratio — pages crawled divided by visitors referred — is the cleanest single measure of whether a bot is a fair trade. The gap between operators is staggering.

Pages crawled per referral sent back (Jan-Mar 2026)

A four-order-of-magnitude gap between training-heavy crawlers and search.

Source: GEO data report compiling network-scale radar data

ClaudeBot's ~24,000:1 and GPTBot's ~1,276:1 reflect their training-heavy behaviour: they read enormous volumes and, because most of that reading never surfaces as a cited answer, they send back almost nothing. PerplexityBot (111:1) and Copilot (33:1) are search-and-answer products, so they crawl less per referral and return more. DuckDuckBot's 1.5:1 is what a traditional search relationship looks like — roughly a visit for a page.

These ratios are not static. One operator's ratio improved from nearly 287,000:1 in January 2025 to around 38,000:1 by mid-year — a real reduction, but still three to four orders of magnitude worse than search. Others moved the wrong way: one answer engine more than tripled its crawl load relative to referrals over the same period. Watch the direction, not just the snapshot.

What this means for site owners

First, stop treating AI bots as one decision. Blocking everything in robots.txt protects bandwidth but removes you from the AI answers that increasingly drive discovery — a real cost as more buyers start their research inside an assistant. Allowing everything means serving tens of thousands of training crawls for every visitor you get back. The right move is per-bot: allow and optimise the search-and-answer crawlers that cite and refer (the low-ratio ones), and make a deliberate, eyes-open choice about the training-only crawlers at the high end.

Second, measure the return, not just the volume. Server logs tell you who crawled. They do not tell you who sent a human back. Pair your crawl logs with referral data so you can compute your own crawl-to-refer ratio per bot. If a user-agent is reading thousands of pages a week and your referral data shows zero traffic from its parent product, you are subsidising a training run — and you should decide, deliberately, whether that is worth it.

Third, serve crawlers content they can actually use. The bots that do refer can only cite what they can parse. Pages that render blank without JavaScript contribute nothing to an AI answer no matter how often they are fetched. Making your key pages fully readable — clean, server-rendered, semantically explicit HTML — is what converts a crawl into a citation, and a citation into the rare, valuable referral that the crawl-to-refer numbers show is so hard to earn.

The bots are not going to crawl less. The only variable you control is whether all that reading turns into anything for you.