Bot Traffic · June 17, 2026

AI Crawlers Hit Your Origin 4,200 Times a Day — Most of It Bypasses the CDN

GPTBot sends a median 4,200 requests per site per day. ClaudeBot sends 1,800. With 70–100% of those being unique URLs, your CDN cache barely touches training-crawler traffic — and every miss hits your origin.

GPTBot sent a median 4,200 HTTP requests per site per day across a 30-day sample of 12 production sites analyzed in April 2026. ClaudeBot logged 1,800. PerplexityBot came in at 980. Google-Extended, the most restrained of the four, sent 540. None of those numbers appeared in Google Analytics. Every one of those requests hit an origin server.

Method

The per-bot crawl rates come from a 30-day server log study published in April 2026, covering twelve production sites: four B2B SaaS, three e-commerce, three agency, and two publishers, ranging from 380 to 48,000 indexed pages, with eleven distinct user-agent strings tracked. Traffic-composition and purpose-classification breakdowns are from published edge network Radar data for May and June 2026. The cache-layer findings come from CDN engineering research conducted in collaboration with ETH Zurich and published in April 2026.

Crawl intensity by bot

Median AI Crawler Hits per Site per Day (30-Day Study, April 2026)
GPTBot generates 2.3× more daily per-site requests than ClaudeBot and 7.8× more than Google-Extended, despite similar overall market share. All four bots respected robots.txt throughout the study.

The per-site rate differences are wider than market-share percentages imply. GPTBot's 4,200 daily requests per site is 2.3× ClaudeBot's rate, 4.3× PerplexityBot's, and 7.8× Google-Extended's. The study found GPTBot revisits its high-traffic URL set on a 2.4-day average cycle. Across all four bots, robots.txt compliance was 100% in the study period — the crawl volumes are deliberate, not the result of runaway loops or misconfigurations.

Rates also vary by content type. Publisher and e-commerce sites in the study attracted higher GPTBot traffic than agency sites, consistent with training crawlers weighting their schedules toward content-dense domains. Sites with large content catalogues — thousands of articles, product pages, or documentation entries — are likely to sit above the 4,200-per-day median.

What those requests are actually for

AI Crawler Request Purpose (May 2026)
Only 9.3% of AI crawler requests are pure search-retrieval that can generate referral traffic. The remaining 90.7% are training or mixed-purpose crawls that consume bandwidth without sending visitors.

Only 9.3% of AI crawler requests were classified as pure search by edge network data in May 2026 — meaning real-time retrieval requests that answer live user queries and might generate a referral back to the source site. Training-purpose requests accounted for 51.8%, with a further 35.7% classified as mixed-purpose: training data collection plus retrieval indexing.

Applied to GPTBot's 4,200 daily hits per site: roughly 390 are attributable to search or retrieval products. The remaining 3,810 are training data collection with no corresponding referral signal. This purpose split is why crawl-to-referral ratios diverge so sharply between bots — high crawl volume from a training-heavy crawler does not translate into AI search visibility. A bot appearing in your access logs is not confirmation that your content is being surfaced in AI-generated answers.

The cache layer training crawlers bypass

HTML Web Traffic: Bots vs Humans (June 2026)
Automated traffic crossed 57.5% of HTML web requests on June 3, 2026 — 18 months earlier than analysts had predicted. Agentic AI task delegation was the primary driver.

Edge network data published June 3, 2026 showed bots generating 57.5% of HTML web traffic, crossing 50% for the first time in internet history — a threshold previously placed at 2027. The proximate cause is agentic task delegation. A user comparison-shopping opens five browser tabs; an AI agent completing the same task can issue thousands of rapid page-load requests in seconds. At the scale of millions of users delegating tasks to AI assistants, the machine-to-human ratio flipped 18 months ahead of the forecast.

The cache impact is structural. CDN engineering research found that AI crawlers produce a 70–100% unique URL access ratio per crawl session — meaning 70 to 100% of requests are for URLs not previously seen in that session. Standard CDN caches are optimized for human browsing, which concentrates traffic on a predictable set of popular pages at high repetition rates. Training crawlers do not follow this pattern. They traverse internal link graphs depth-first, surface dormant URLs not seen in weeks, and issue parameterized variants that bypass query-normalization rules. Most training-crawler requests miss the CDN cache layer entirely and land on origin servers as cold fetches.

At the site level, a medium-sized content site that compared raw server logs against its analytics in 2026 found AI crawlers accounting for 40% of total bandwidth — a figure invisible in filtered analytics dashboards.

What this means for site owners

Measure bot load from server logs, not analytics. Analytics platforms strip bot traffic before recording events, so dashboards will show no signal from any of the crawl volumes described above. Export raw access logs and segment traffic by user-agent string. Sites in high-interest verticals — technical documentation, financial data, healthcare content — tend to see above-average training-crawler rates because training data demand concentrates on structured, factual content.

Rate limiting cuts origin load without destroying AI search referral traffic. A content site that imposed per-user-agent per-minute rate limits in 2026 reduced bot-driven bandwidth by 70% without a measurable drop in AI search referral traffic. The mechanism is simple: search-purpose crawlers — 9.3% of AI requests — are more conservative and less sensitive to moderate rate limits. Training-purpose crawlers — 51.8% of AI requests — are responsible for the volume but produce no referrals, so throttling them has no effect on AI search presence.

Burst patterns require different handling than sustained crawl rates. GPTBot was absent from several study sites for multi-week stretches before returning with concentrated activity — one site logged 152 requests in a single three-minute window after a dormant period. A flat per-minute rate cap handles steady-state volume but is bypassed by this burst pattern. Per-bot burst detection at the edge — rules that trigger on spikes above a rolling average rather than enforcing steady-state caps — is more effective. Serving pre-rendered or cached responses to identified training crawlers eliminates dynamic rendering cost per request and satisfies the crawler without additional origin compute.

Sources

  1. Agentic Crawler Behavior: 30-Day Site Log Study 2026
  2. Bots Have Now Passed Human Traffic Online, Cloudflare Boss Laments
  3. Cloudflare and ETH Zurich Outline Approaches for AI-Driven Cache Optimization
  4. AI Crawler and Bot Traffic Statistics 2026: Key Data Reference