Practical guides·8 min read

How to verify AI agents

A publisher operator’s guide to telling legitimate AI agents from spoofed ones. IP ranges, reverse-DNS, TLS fingerprints, request signing, and the policies that sit on top.

What is AI agent verification?

AI agent verification is the process of confirming that a request claiming to come from a named AI crawler — GPTBot, ClaudeBot, PerplexityBot, Googlebot, Applebot — actually came from the organization that operates it. It is a different problem from bot detection. Bot detection answers whether the client is automated. Verification answers whether the automation is who it says it is.

The problem shows up as soon as you look at your logs. A user agent string is a header the client chose to send. Any scraper can set User-Agent to GPTBot. The line in your access log is not evidence of identity, it is evidence of a claim. Verification is the gap between the claim and the identity.

Three kinds of traffic need the distinction. Search indexers that send referral traffic and belong on an allowlist by default. AI crawlers you have licensed or chosen to permit. And agents acting for a real human user, where the operator matters more than the fact of automation. The wrong decision costs something — lost search ranking, missed licensing revenue, a blocked customer mid-purchase.

Why AI agent verification matters right now

Volume forces the decision. Cloudflare Radar reports that 39% of the top one million websites were accessed by AI bots by early 2026, and only 2.98% of those sites actively block them. HUMAN Security measured AI agent traffic growing 7,851% across 2025. When AI traffic is a rounding error, a blanket block or blanket allow is cheap. When it is a third of your requests, each side of that decision is expensive.

Compliance with stated rules has broken down. Tollbit's Q4 2025 State of the Bots report measured ChatGPT-User accessing 42% of sites that had explicitly blocked it in robots.txt, and 30% of all AI bot scrapes ignoring robots.txt permissions outright. robots.txt is not a verification mechanism and never was. It is a request.

Spoofing is cheap. DataDome's 2024 Global Bot Security Report found 95% of advanced bot attacks pass passive inspection, and 83% of simple curl-based bots pass unnoticed. Pretending to be a named AI crawler is a weekend project: rotating residential IPs, a user agent copied from the vendor docs, a TLS library that reproduces Chrome handshakes.

The commercial stakes flipped last year. OpenAI, Perplexity, and Google have signed licensing deals with publishers. That revenue depends on a platform being able to tell the licensed agent from the scraper imitating it. Verification is now the meter.

Types of verification signals

Verification signals fall into four tiers, ranked by how expensive they are to spoof.

**User-agent claim.** Free. Any client can send any user agent. Matching a published AI bot string in your logs identifies the traffic that was already trying to be compliant, and identifies nothing else.

**IP range and reverse-DNS.** Cheap to check, hard to spoof at scale. Googlebot publishes official IP ranges and supports reverse-DNS plus forward-DNS verification — you look up the PTR record on the source IP, then the A or AAAA record on the resulting hostname, and confirm the hostname ends in googlebot.com. Bingbot follows the same pattern under search.msn.com. Applebot publishes its ranges. OpenAI publishes GPTBot, OAI-SearchBot, and ChatGPT-User IP ranges on platform.openai.com. Anthropic publishes ClaudeBot and Claude-User ranges. PerplexityBot publication has been inconsistent. The cost is maintenance: the directories update, and the list you checked last quarter is already stale.

**Cryptographic request signing.** Expensive to deploy, essentially impossible to spoof if the private key stays private. The IETF draft-ietf-httpbis-message-signatures specification gives a standard way for a client to sign a request with its identity. No major AI vendor mandates it yet. Cloudflare's Web Bot Auth proposal and Anthropic's experiments with signed agent passes are the early moves. Useful to watch, not yet useful to rely on.

**Behavioral and fingerprint signals.** Expensive for the attacker, cheap for the defender. JA4 TLS fingerprints hash the cipher suites, extensions, and extension order a client sends in its ClientHello. Cloudflare tracks roughly 15 million unique JA4s across its edge per day. HTTP/2 SETTINGS values and pseudo-header ordering differ by browser family in ways spoofing libraries rarely copy. Request cadence, revisit intervals, and path patterns separate a scheduled crawler from a burst of rotating-proxy traffic. Fingerprints catch the long tail that IP ranges and signatures do not.

How AI agent verification works

A production verification pipeline takes three layers of signal per request and produces an identity verdict before the origin sees the body.

The first layer is static identity. The edge compares the user agent and source IP against a maintained directory of AI crawler identities. If the source IP sits inside Anthropic's published ClaudeBot range and the user agent matches, the claim is consistent with the public record. If the source IP is a residential proxy and the user agent says ClaudeBot, the claim is already dead.

The second layer is cross-layer consistency. The TLS handshake exposes a JA4 fingerprint. Chrome sends a specific WINDOW_UPDATE value in its initial SETTINGS frame — about 15MB. Firefox sends about 12.5MB. GPTBot sends whatever library OpenAI uses, stable across requests. A request claiming to be Chrome with a Python TLS fingerprint and an HTTP/2 setting from a Go library has lied twice. Any single layer is spoofable. The combination is not, because the spoofing toolchains do not cover every layer at once.

The third layer is behavior over time. A real ClaudeBot makes a steady number of requests per second, revisits on a predictable cycle, and stays inside its own IP range. A spoofed ClaudeBot bursts a hundred pages in a minute, drifts across autonomous systems, and stops when its proxy pool exhausts. A rolling window of 50 to 100 requests per source is usually enough to classify a new fingerprint with high confidence.

Build or buy is a question of directory maintenance more than engineering. Writing a reverse-DNS lookup is a weekend. Keeping IP ranges and user agents for 50 AI crawlers accurate over 24 months is a full-time job. Cloudflare offers AI Crawl Control as a managed layer. DataDome and Kasada maintain commercial directories. Centinel ships fingerprints for the long tail and the majors. The data has to stay fresh.

How to identify a legitimate AI agent

Start with the operators that publish. GPTBot, OAI-SearchBot, and ChatGPT-User have IP ranges and a published user agent. ClaudeBot and Claude-User are documented at docs.anthropic.com. Googlebot and Google-Extended support reverse-DNS and publish IP ranges. Applebot is documented at support.apple.com. Bingbot is documented at bing.com/webmasters. If a request claims to be one of these and the IP sits outside the published range, it is a spoof. Block at the edge and move on.

For operators that publish a user agent but no maintained IP range — PerplexityBot has fluctuated, several smaller AI startups publish nothing — fall back to fingerprint-plus-behavior. If the JA4 is stable across a rolling window, the cadence matches a training crawler, and the path pattern matches a crawl pass rather than a targeted pull, the request is probably legitimate. Log it as unverified-but-consistent.

For agent-on-behalf-of-user traffic — a ChatGPT agent buying a ticket, a Claude for browsing session, a custom agent built on Anthropic's Model Context Protocol — identification moves up a layer. MCP servers expose capabilities to agents over authenticated channels, with bearer tokens at the MCP layer rather than at HTTP. The verification question becomes whether the agent presented a valid token and which operator issued it.

A publisher has to pick a lens. A newsroom CTO cares most about training and retrieval crawlers — those consume archive content and sit at the centre of the licensing conversation. A DTC e-commerce platform cares most about agent-on-behalf-of-user traffic, because those agents complete purchases and belong on a verified-allow path. A SaaS docs platform cares most about search and retrieval crawlers, because its content needs to be cited. The signal mix is the same. The policy is not.

How to respond to unverified agents

Verification without a response is a log file. Three responses cover almost every case.

Block the ones that fail basic consistency. A request with a GPTBot user agent from a residential IP, a curl-impersonate TLS fingerprint, and a burst rate of 200 pages per minute is not GPTBot. Drop it at the edge. Origin costs and licensing leakage both come down at the same time.

Challenge the ones in the grey zone. A fingerprint you have not seen before, consistent with a headless browser, hitting pages a human reader would hit. An interstitial check or a proof-of-work challenge separates a curious developer testing an agent from a commercial scraper behind residential proxies. The challenge is cheap for a real user, expensive for a scraped pool.

Verify-and-allow the rest. This is the unpopular answer and usually the correct one. Cloudflare Radar's 2.98% block rate on AI bots across the top one million sites is not a failure of bot management — it is a sign that most operators have concluded blanket blocks cost more than they save. The right default for a verified search indexer, a verified retrieval crawler, or a verified agent with a signed pass is to let it through, log the operator, and watch cumulative volume. Block goes on the table only once a specific operator fails verification or exceeds a budget you set.

Verification is not the same thing as safety. A legitimate GPTBot operator can run its crawler behind a residential proxy pool. A real Claude-User session can be driven by an abuse script at the other end. Verification confirms the operator. The operator's behavior is a separate check.

This is the layer Centinel implements for publishers who do not want to maintain their own directory. Every request is fingerprinted against 1,600+ AI agent profiles, checked for cross-layer consistency, and dispatched to a per-agent policy — block, allow, challenge, or charge — in under 2ms at the edge. The block list stays current because the fingerprints stay current.

Key takeaways

Verification is not detection. Detection asks whether the client is automated. Verification asks whether the automation is who it says it is, and every figure from Cloudflare Radar, Tollbit, and HUMAN Security points to that gap widening fast.

The signals stack. User agent is free and spoofable. Published IP ranges and reverse-DNS are cheap and hard to spoof if you keep the directory fresh. TLS and HTTP/2 fingerprints catch the long tail. Cryptographic request signing is the future, not the present. Any one signal in isolation is a coin flip. The combination is a verdict.

The response is policy, not a single setting. Block the clear spoofs, challenge the ambiguous, verify-and-allow everything that identifies itself honestly. Cloudflare Radar's 2.98% block rate is the market's current answer.

Centinel runs that decision layer at the edge, with a maintained directory of AI agent fingerprints and a policy engine that fires per agent in under 2ms. For a publisher who does not want to build and maintain the directory, that is the operator's verification pipeline in one component.

See what's crawling your site right now

Run a free audit and get a detailed report of which AI crawlers are accessing your content. 48 hours.

Get your free audit