Fundamentals·8 min read

What is AI agent traffic?

AI agent traffic is a new traffic class — training crawlers, retrieval crawlers, agentic workflows, and spoofed scrapers. How it differs from classic bots and what publishers can do about it.

What is AI agent traffic?

AI agent traffic is automated web traffic generated by software that acts on behalf of a large language model, a retrieval-augmented generation system, or an autonomous task-running agent. It is not the same thing as classic bot traffic. A classic scraper extracts data on a schedule for its operator. An AI agent fetches a page because a model — or a user prompting a model — decided, in that moment, that the content was needed to answer a question, complete a purchase, or fill a context window.

Four kinds of software produce it. Training crawlers like GPTBot, ClaudeBot, Google-Extended, and Applebot-Extended sweep the web to build training corpora. Retrieval crawlers like PerplexityBot and OAI-SearchBot fetch pages in real time to answer a user's question. Agent-on-behalf-of-user traffic comes from ChatGPT agents, Anthropic's Claude for browsing, and emerging agentic frameworks that submit forms and click links for a human. And a long tail of unlabeled scrapers — cohere-ai, CCBot, Meta-ExternalAgent, and a growing list of startups crawling behind residential proxies — sits in between.

It collapses the old distinction between bot and visitor. The request arrives from an automated client. The intent behind it originated, seconds earlier, with a person asking a chatbot a question.

Why AI agent traffic matters right now

Volume is the short answer. HUMAN Security's 2025 Intelligence report measured AI agent traffic growing 7,851% over 2025. Cloudflare Radar reports that 39% of the top one million websites were accessed by AI bots by early 2026, while only 2.98% of those sites actively block them. Tollbit's Q4 2025 State of the Bots reported a bot-to-human ratio on publisher sites of 1 AI bot visit for every 31 human visits, up from 1 in 50 two quarters earlier.

Bandwidth is the second answer. Cloudflare measured Anthropic's crawl-to-referral ratio at roughly 500,000 to 1 through 2025 — half a million pages fetched for every visitor sent back — and AI training crawl traffic rose 65% in six months. Every page hit is origin cost. Every archive sweep is a cache miss.

The third answer is commercial. AI agents are the new intermediary between your content and the reader. A publisher's article read inside ChatGPT, summarized by Perplexity, or cited in an AI Overview produces no ad impression, no subscription prompt, no direct reader relationship. The traffic is real. The monetization path is not.

Classic bot management was built for a different problem. Blocking scrapers and blocking AI agents are not the same decision, and treating them the same way either cuts you off from a search channel you want to be in, or lets a training crawler through that you would rather charge.

Types of AI agent traffic

Four distinct types show up in server logs, each with different commercial implications.

**Training crawlers.** Operated by model companies to build training datasets. GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended, Bytespider (ByteDance), Applebot-Extended, and CCBot (Common Crawl, whose data feeds many smaller model companies) are the main examples. These crawlers sweep broadly and revisit frequently. Their requests are the clearest candidates for licensing: the operator has a budget for data acquisition, and the legal pressure around unlicensed training is rising.

**Retrieval and grounding crawlers.** Fetch pages at query time to ground a model's answer. PerplexityBot, OAI-SearchBot (OpenAI's search crawler, separate from GPTBot), and ChatGPT-User are the named ones. Tollbit measured ChatGPT-User specifically accessing 42% of sites that had explicitly blocked it. These crawlers are closer to search indexers than training crawlers, but they do not send referral traffic in the way Googlebot does.

**Agentic traffic.** Generated by AI agents acting for a specific human user. A ChatGPT agent checking flight prices. An Anthropic Claude agent researching a paper. Browser-use and similar frameworks clicking through workflows on a user's behalf. The request comes from a headless browser running on cloud infrastructure, often routed through residential proxies, with behavior that looks like a human until it doesn't.

**Unlabeled and spoofed crawlers.** The largest and messiest category. cohere-ai, Meta-ExternalAgent, and a long list of smaller operators. Commercial scraping services (BrightData, Oxylabs, ScraperAPI) selling access to rotating residential IP pools. Training and retrieval crawlers that decline to identify themselves. DataDome's 2024 report found 95% of advanced bot attacks go undetected by passive inspection, and 83% of simple curl-based bots pass unnoticed. Unlabeled traffic is where the licensing revenue leaks out.

How AI agent traffic works

Mechanically, AI agent traffic is HTTP. Each request has a user agent, a TLS handshake, a set of HTTP/2 settings, and a body. What separates AI agent traffic from browser traffic is the software stack making the request and the intent behind it.

Training crawlers are the simplest. A scheduler runs, a fetcher opens an HTTP connection, a parser extracts text and links, the results go into a dataset. GPTBot and ClaudeBot publish IP ranges and respect robots.txt in most cases. Their footprint in logs is predictable: a consistent user agent, a consistent TLS fingerprint, a steady request cadence.

Retrieval crawlers are stateful. When a user asks a chatbot a question, the model decides which pages to fetch. PerplexityBot or OAI-SearchBot opens connections to those specific URLs, pulls the content, and hands it back to the model within a few seconds. The request pattern is bursty — many pages from different domains fetched in parallel — and driven by query volume, not by a crawl schedule.

Agentic traffic is the hardest to characterize. An AI agent running a workflow may use a patched Chromium build, a headless browser, or a direct HTTP client depending on whether the task requires JavaScript execution. Many route through residential proxies to avoid rate-limiting. Some use curl-impersonate, uTLS, or similar libraries to reproduce a real browser's TLS handshake byte-for-byte. The user agent string is whatever the operator chose to send.

Spoofing is the dominant tactic at the long-tail end. A scraper rotates through thousands of residential IPs, swaps user agents per request, and uses a TLS library that reproduces Chrome's JA3/JA4 fingerprint. On the surface, the traffic is indistinguishable from a human visitor. Only when you compare signals across layers — TLS handshake, HTTP/2 SETTINGS frame, behavioral pattern, request rate — does the mismatch appear.

How to identify AI agent traffic

User agents are the starting point, not the answer. GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, PerplexityBot, cohere-ai, CCBot, Meta-ExternalAgent — the major operators publish their strings. Matching those in your logs identifies the compliant traffic, which is the traffic that was already least likely to cause a problem.

For the rest, you need request-level signals the client does not fully control.

**TLS fingerprinting.** The client hello in a TLS handshake exposes the cipher suites, extensions, and extension order of the underlying library. Python's requests produces one signature, curl another, real Chrome another. JA4 (and its successors JA4S, JA4H) hash those signals into a fingerprint that is resistant to extension randomization. Cloudflare tracks roughly 15 million unique JA4 fingerprints across its edge each day. A Python TLS stack claiming to be Chrome is caught before the HTTP body is sent.

**HTTP/2 settings.** Chrome sends a WINDOW_UPDATE of ~15MB in its initial SETTINGS frame. Firefox sends ~12.5MB. Most HTTP libraries send zero. The pseudo-header order (`:method`, `:authority`, `:scheme`, `:path`) is fixed per browser and does not match what libraries send by default.

**Behavioral patterns.** Request cadence, path patterns, revisit intervals, and session coherence. A real user reading an article dwells. A training crawler moves at a consistent rate. A spoofed scraper bursts through a hundred pages in a minute.

**Cross-layer consistency.** The decisive check. A request that claims to be Chrome via user agent, carries a TLS fingerprint from curl-impersonate, and has HTTP/2 settings from a Go library is an AI agent that lied twice. Any one signal is spoofable. The combination is not, because the spoofing libraries do not cover every layer consistently.

How to respond to AI agent traffic

You have three responses available once the traffic is identified: block, verify, or watchlist. Pick per agent, not per traffic source.

**Block.** For training crawlers you have not approved. For scrapers that ignore robots.txt. For spoofed traffic that fails cross-layer consistency. Block at the edge so the origin never sees the request and your bandwidth bill never grows because of it.

**Verify and allow.** For search indexers you want to appear in. For partner agents. For AI-on-behalf-of-user traffic you want through but want to audit. Pass the request with a signed trust stamp, log the agent's identity, and monitor cumulative volume per operator. Googlebot, Bingbot, and verified AI-search user agents belong on an allowlist by default — Cloudflare's 2.98% block rate shows most operators are not cutting themselves off from search.

**Watchlist.** For training crawlers you have not decided about yet. Centinel records every visit per agent and gives you the audit trail to take action later — block, challenge, or escalate when policy is set. The commercial conversation around AI and publisher content is live, but the decision is yours to make per agent.

robots.txt alone will not execute any of these. Tollbit measured 30% of AI bot scrapes in Q4 2025 ignoring explicit robots.txt permissions. The file is a courtesy notice. Enforcement lives at the edge, in a layer that inspects the request before it reaches the origin.

Key takeaways

AI agent traffic is not classic bot traffic. It is a new category that includes training crawlers, retrieval crawlers, agentic workflows, and a long tail of unlabeled scrapers. The volume is already large and growing fast — HUMAN Security measured 7,851% growth in 2025, Cloudflare sees 39% of top sites accessed by AI bots, and Tollbit sees a 1-in-31 AI-to-human visit ratio on publisher content.

The response is not a single setting. Each class of agent calls for a different decision: block training crawlers you have not approved, verify and allow search indexers and partner agents, watchlist the ones you have not decided about, and drop spoofed traffic that fails cross-layer checks. robots.txt is where the conversation starts. Enforcement happens at the edge.

Centinel identifies 1,600+ AI agent fingerprints in real time, applies TLS and HTTP/2 signal checks that survive user agent spoofing, and runs block, verify, or monetize decisions per agent in under 2ms. That is the layer between you and the 7,851% of new traffic that did not ask permission.

See what's crawling your site right now

Run a free audit and get a detailed report of which AI crawlers are accessing your content. 48 hours.

Get your free audit