Practical guides·10 min read

robots.txt for AI bots: Complete guide

How to configure robots.txt for AI crawlers. Every directive, every major bot, and why robots.txt alone isn't enough.

What is robots.txt

robots.txt is a plain text file at the root of your website (yoursite.com/robots.txt) that tells automated clients which paths they are welcome to visit. It is the web’s oldest standard for communicating with crawlers, designed in 1994 for search engines and now at the center of the AI crawler debate.

The file is a set of directives. User-agent selects the crawler, Disallow lists blocked paths, Allow carves exceptions. Crawlers fetch /robots.txt before anything else and honor what they find.

Why robots.txt matters right now

robots.txt matters because it is the public statement of your crawling policy — and because in 2026 it is also the most widely ignored file on the web.

Tollbit’s Q4 2025 State of the Bots reported that roughly 30% of AI bot scrapes ignore robots.txt entirely. The same report measured ChatGPT-User fetching 42% of sites that had explicitly blocked it. Cloudflare Radar measured 39% of the top one million websites being accessed by AI bots while only 2.98% block them in robots.txt.

This gap is what makes the file both important and insufficient. It is the first thing a compliant crawler reads and the first thing a lawyer quotes. It is also not, on its own, enforcement.

Types of robots.txt directives for AI crawlers

AI crawlers are addressed per user agent. The main strings in 2026: GPTBot, ChatGPT-User, OAI-SearchBot (OpenAI), ClaudeBot (Anthropic), Google-Extended, Bytespider, Applebot-Extended, CCBot, PerplexityBot, Amazonbot, Meta-ExternalAgent, cohere-ai.

Three configurations cover most policies.

**Block all AI crawlers, allow search engines.** A Disallow for each AI user agent, with Googlebot and Bingbot allowed. Default for publishers that want search visibility but no AI training.

**Allow AI crawlers, restrict to sections.** Granular Allow and Disallow paths per user agent — for example, /public/ allowed and /archive/ blocked for GPTBot. Useful when you want AI search indexing but want training crawlers kept away from premium content.

**Blanket allow.** A single User-agent: * with Disallow empty. The default posture for many sites, which is why 97% of the top million are accessible to AI bots today.

robots.txt cannot distinguish honest identification from a spoofed user agent. It also cannot set different policies for different kinds of use by the same crawler. The directives are coarse.

How robots.txt works

Crawlers are expected to request /robots.txt before crawling, parse the directives, and follow the rules for their user agent. The protocol is honor-based. A crawler that chooses to ignore the file faces no technical barrier.

Matching is longest-prefix per user agent, with fallback to User-agent: *. The file is cached by crawlers for up to 24 hours under Google’s implementation, so a policy change takes time to propagate.

And the file is public. Anyone can read yoursite.com/robots.txt, including the crawlers you are trying to block. That transparency is useful for a compliant ecosystem; it is a liability when operators read the file as a map of what is worth taking.

How to detect when robots.txt is being ignored

Three checks close the gap between what your robots.txt says and what is actually happening.

**Log sampling against declared blocks.** Pull the list of user agents you have disallowed. Grep access logs for hits from those agents. Any matches are either honest crawlers that missed the memo or dishonest ones that read it and kept going. Tollbit’s 42% ChatGPT-User figure came from this kind of comparison.

**User agent honesty.** A GPTBot request should come from an IP in OpenAI’s published range. Googlebot should resolve via reverse DNS to a Google host. A declared user agent from an IP that does not match the operator’s range is a spoof.

**Blocked-UA traffic trend.** Track the volume from disallowed user agents over time. If the number does not fall after you add the disallow, the file is informing the crawler, not enforcing anything.

How to prevent crawler evasion when robots.txt fails

For enforceable access control, you need a layer that identifies crawlers regardless of stated identity and decides per request in real time.

That layer sits at the edge, before requests reach origin. It matches the TLS fingerprint against a library signature, checks the HTTP/2 SETTINGS frame for browser-vs-library markers, correlates the user agent with the origin IP’s autonomous system, and runs those checks against a database of known crawler signatures — Centinel tracks 1,600+. A scraper using curl-impersonate to look like Chrome is caught on the TLS handshake, not on the body of the request.

Once identified, the crawler can be blocked, verified and allowed, or redirected to a paid licensing path. None of those options exist in robots.txt. All three are per-request decisions. robots.txt was never built to enforce; it is a courtesy notice, and enforcement is a separate job.

Key takeaways

- robots.txt is the public statement of your policy and the first thing a compliant crawler reads — but 30% of AI bot scrapes ignore it and 42% of sites that block ChatGPT-User still see it fetching their pages (Tollbit Q4 2025). - Update the file regularly as new AI crawlers emerge (GPTBot, ClaudeBot, Google-Extended, ChatGPT-User, Applebot-Extended, Meta-ExternalAgent matter in 2026). - Use robots.txt as a baseline, not a sole defense. It cannot tell honest identification from a spoof and cannot set per-use policies for the same crawler. - Enforcement lives at the edge: TLS fingerprinting, HTTP/2 checks, and a crawler signature database convert the courtesy notice into a real policy.

See what's crawling your site right now

Run a free audit and get a detailed report of which AI crawlers are accessing your content. 48 hours.

Get your free audit