Bot Traffic · July 5, 2026

Why would the same AI bot crawl your site once a month — then spike to 39,000 hits a minute?

Most site owners treat all AI crawlers the same way. But training bots and real-time fetchers behave nothing alike — and confusing them is costing you visibility in AI search.

Why would the same AI bot crawl your site once a month — then spike to 39,000 hits a minute?

53% of all web traffic in 2025 was automated, according to Thales's 2026 Bad Bot Report — and that figure has been climbing every year. But the aggregate hides something important: the AI bots behind this growth aren't doing the same thing, hitting at the same frequency, or responding to the same server behaviour. Some of them trickle in once a month. Others arrive in waves that can hit a single URL 39,000 times per minute.

So which kind is hitting your site? And are you actually set up to serve the ones that could send you real referral traffic — or are you unintentionally turning them away?

Where does the data come from?

The numbers here draw on three independent research programs. Fastly analysed 6.5 trillion monthly requests across 130,000+ web applications for Q2 2025, breaking down AI crawler traffic by operator and purpose. Thales's 2026 Bad Bot Report covers a full calendar year of domain-level bot data. SE Ranking's AI traffic research study tracked actual referral clicks from AI platforms to publisher sites globally. These programs measured different slices of the internet — which is why their total bot-share figures differ — but their directional findings on AI crawler behaviour land in the same place.

Which operators are doing the most crawling?

Fastly's Q2 2025 research gives the clearest breakdown by company:

AI Crawler Traffic Share by Operator (Q2 2025)

Meta's training bots alone accounted for more than half of all AI crawler traffic across Fastly's network in mid-2025, ahead of Google and the maker of ChatGPT combined.

Source: Fastly Q2 2025 Threat Research

Meta's crawlers accounted for more than half of all AI crawler traffic observed across Fastly's network during that period. That's a striking figure for a company that doesn't run a conventional search engine. Meta crawls at scale to build training datasets, not to send traffic back to publishers. Much of it runs through infrastructure-level crawlers most site owners have never explicitly dealt with — not the consumer-facing product names you might recognise, but background systems operating at massive scale.

Google's crawlers came in at 23% of AI bot traffic. That's a mixed picture. Googlebot still sends referral traffic back in meaningful volumes — a crawl exists to power a search result that a user eventually clicks through. But Google-Extended, which feeds only model training, sends nothing back to publishers. The two are often lumped together in server logs, which makes it hard to tell which Google bot visit is actually worth optimising for.

GPTBot — operated by the maker of ChatGPT — sits at 20% of AI crawler traffic by Fastly's measure. Here's the thing that trips people up: GPTBot crawls the web for training data, while the real-time web-fetch mode that powers live ChatGPT queries is an entirely separate system. Same brand, completely different behaviour, completely different implications for what you should actually do about it.

What's the real difference between training bots and real-time fetchers?

This is where the conversation gets practically useful. These two categories hit your site in fundamentally different ways.

Training crawlers are doing a slow, systematic sweep of the internet to build model datasets. They visit a URL, pull the content, and might not return for weeks or months. Meta's majority share is almost entirely this pattern. The per-page visit frequency is low — your access logs might show one visit from a training bot per page per month. These crawlers tend to go broad, including content that hasn't changed recently, because coverage and semantic density matter more than freshness for building a training set.

Real-time fetchers are a completely different animal. When someone asks a question to an AI assistant with web browsing enabled, the assistant fires live requests to relevant URLs — right now, with the user waiting for a response. Because these systems often query multiple sources simultaneously to triangulate an answer, the same popular URL can see enormous burst volumes in a short window. Fastly's research put the upper bound on this at 39,000 requests per minute to individual sites. That's not an attack or a misconfiguration — it's what happens when a URL becomes a reliable answer source for a widely-used AI assistant.

CDN-level verified bot traffic data from mid-2026 confirms how the two purposes split across all AI crawler requests:

AI Crawler Requests by Purpose (May 2026)

Training still dominates by volume, but mixed-purpose and retrieval requests are growing fastest — driven by real-time AI assistant queries.

Source: CDN Verified Bot Traffic Analysis

Training crawlers still account for the biggest share of AI crawl volume. But the retrieval and user-triggered fetch categories are growing fast — and when they arrive, they arrive at a completely different cadence.

Are publishers actually blocking the right bots?

Analysis of robots.txt DISALLOW rules across a large site sample in Q1 2026 found GPTBot was the most commonly blocked AI crawler, appearing in 5.52% of all DISALLOW entries. ClaudeBot came third at 4.88%, behind Common Crawl at 5.08%.

Does that strategy actually make sense? It depends entirely on what you're trying to achieve. If you want to prevent your content from being used in model training, blocking training crawlers is reasonable. But those blocks don't stop the AI assistant products from the same companies from fetching and citing your content in response to live user queries — because live queries use a different fetch mechanism altogether.

SE Ranking's referral traffic research found that ChatGPT accounts for roughly 4 in every 5 AI-driven clicks to websites. Perplexity contributes around 15%. Those clicks flow from real-time retrieval fetches, not training crawls. If you've blocked all bots from a given company without distinguishing between its training crawler and its live-query fetcher, you may have cut off the mechanism that was actually sending you traffic.

So what should you actually change?

Think in terms of crawler purpose, not brand names. A blanket block on "AI crawlers" treats a once-a-month training sweep identically to a real-time retrieval fetch that a user is waiting on. Those two things need different responses. Training crawlers benefit from semantically rich, pre-rendered HTML that makes it easy for a model to understand your content. Real-time fetchers need low-latency responses — if your site is slow or renders blank without JavaScript, you're being skipped in favour of faster competitors.

Understand what each bot type can actually read. AI platforms can only cite what they can successfully fetch and parse. If a training crawler arrives at a JavaScript shell that requires a browser to render real content, that visit produces nothing useful — your content doesn't make it into the dataset. If a real-time fetcher gets a slow or empty response, it moves on. Static, semantically rich HTML is the format both categories handle best.

The headline figure is real, but it's not the number to optimise for. 53% bot traffic sounds alarming or impressive depending on who's reading it. What matters is identifying which AI bots you want to serve, what they're trying to do when they arrive, and whether your server response gives them something useful to work with. Training bots arriving every few weeks need comprehensive, readable content they can extract meaning from. Real-time fetchers arriving in burst need fast, structured responses they can return to a waiting user in seconds. Treating both the same way optimises for neither — and quietly hands AI search visibility to competitors who haven't.

Why would the same AI bot crawl your site once a month — then spike to 39,000 hits a minute?

Why would the same AI bot crawl your site once a month — then spike to 39,000 hits a minute?

Where does the data come from?

Which operators are doing the most crawling?

What's the real difference between training bots and real-time fetchers?

Are publishers actually blocking the right bots?

So what should you actually change?

Sources