robots.txt for AI bots: Complete guide
How to configure robots.txt for AI crawlers — every directive, every major bot, and why robots.txt alone isn't enough.
robots.txt for AI bots
robots.txt is the web's oldest standard for communicating with crawlers. Originally designed in 1994 for search engines, it now plays a central role in the AI crawler debate. This guide covers everything you need to know about using robots.txt to manage AI bot access.
How robots.txt works
robots.txt is a plain text file at the root of your website (yoursite.com/robots.txt). Crawlers are expected to check this file before crawling and follow its directives. The key directives are User-agent (which crawler the rule applies to), Disallow (paths the crawler should not visit), and Allow (exceptions to disallow rules).
AI crawler user agents
Here are the user agents for major AI crawlers you should know:
GPTBot — OpenAI's crawler for ChatGPT training data. ClaudeBot — Anthropic's crawler for Claude training data. Google-Extended — Google's AI training crawler (separate from Googlebot). Bytespider — ByteDance's crawler used for TikTok and other AI products. CCBot — Common Crawl's open dataset crawler used by many AI companies. PerplexityBot — Perplexity AI's search crawler. Amazonbot — Amazon's crawler for AI features. FacebookBot — Meta's crawler for AI training. Applebot-Extended — Apple's AI training crawler.
Sample robots.txt configurations
To block all AI crawlers while allowing search engines, you would add a Disallow rule for each AI crawler user agent while keeping Googlebot and Bingbot allowed. To allow AI crawlers but restrict them to certain sections, use more granular path rules with Allow and Disallow combinations.
The limits of robots.txt
robots.txt is a request, not a wall. It has no enforcement mechanism. Crawlers that choose to ignore it face no technical barrier. According to Tollbit's data, approximately 32% of AI crawling activity bypasses robots.txt instructions entirely.
Additionally, robots.txt cannot distinguish between AI crawlers that identify themselves honestly and those that disguise their identity with fake user agent strings. It also cannot set different policies for different types of use — you either allow a crawler entirely or block it entirely.
Beyond robots.txt
For enforceable access control, you need a layer that can identify crawlers regardless of their stated identity and make real-time decisions about whether to allow each request. This is where solutions like Centinel come in — matching requests against 1,600+ crawler fingerprints and enforcing your policy at the edge, whether the crawler identifies itself honestly or not.
Best practices
Update your robots.txt regularly as new AI crawlers emerge. Test your configuration using Google's robots.txt tester. Remember that robots.txt is public — anyone can read it, including the crawlers you are trying to block. Use it as a baseline, not a sole defense.
See what's crawling your site right now
Run a free audit and get a detailed report of which AI crawlers are accessing your content — in 48 hours.
Get your free audit