How to block AI crawlers
A practical walkthrough of every method available — from robots.txt to edge-level blocking — with the tradeoffs of each.
How to block AI crawlers
You can block AI crawlers with two lines in robots.txt. That stops the compliant ones. For the rest, you need request-level detection — TLS fingerprinting, header analysis, or JavaScript challenges. Here are the methods, ordered from simplest to most effective.
Method 1: robots.txt
The simplest approach. Add directives to your robots.txt file telling specific crawlers not to visit your site. For example, to block GPTBot, add "User-agent: GPTBot" followed by "Disallow: /" to your robots.txt.
Pros: Takes 30 seconds. No code changes. Universally understood standard. Cons: Purely voluntary. Tollbit data shows 32% of AI scrapes ignore robots.txt entirely. No enforcement mechanism. Does not stop crawlers that use fake user agents.
Method 2: HTTP header checks
Inspect the User-Agent header on incoming requests and reject known AI crawler signatures. This can be done at the web server level (Nginx, Apache) or in application code.
Pros: Simple to implement. Works at the server level. Cons: Trivially bypassed by changing the user agent string. Many AI crawlers already use generic or misleading user agents.
Method 3: IP blocking
Block IP ranges known to belong to AI companies. OpenAI, Anthropic, and others publish their IP ranges.
Pros: Harder to bypass than user agent checks. Can be implemented at the firewall level. Cons: IP ranges change frequently. Many scrapers use residential proxy networks, making IP blocking ineffective. Can accidentally block legitimate users sharing the same IP range.
Method 4: Rate limiting
Limit the number of requests from a single IP or session within a time window. AI crawlers typically make many more requests than human visitors.
Pros: Reduces scraping volume without blocking entirely. Relatively easy to implement. Cons: Sophisticated scrapers distribute requests across thousands of IPs. Aggressive rate limits can affect legitimate users. Does not identify or classify the requester.
Method 5: JavaScript challenges
Require visitors to execute JavaScript to access content. Many basic crawlers cannot render JavaScript.
Pros: Stops simple HTTP-based scrapers. Cons: Modern scraping tools (Playwright, Puppeteer) render JavaScript fully. Adds latency for real users. Can break SEO if not implemented carefully.
Method 6: Edge-level detection and blocking
Deploy a detection layer at the CDN or edge level that analyzes every request in real time. This combines TLS fingerprinting, behavioral analysis, IP reputation, device fingerprinting, and crawler database matching to identify and block AI crawlers before they reach your origin.
Pros: Comprehensive detection. Catches crawlers regardless of user agent or IP. Sub-2ms latency. No impact on legitimate users. Cons: Requires a specialized provider. More complex than self-hosted solutions.
Centinel operates at this level, identifying 1,600+ crawler signatures and enforcing decisions at the edge in under 2ms.
Which method should you use?
Start with robots.txt as a baseline — it costs nothing and handles well-behaved crawlers. For real protection, you need edge-level detection. The gap between robots.txt (voluntary) and edge detection (enforced) is where most content theft happens.
See what's crawling your site right now
Run a free audit and get a detailed report of which AI crawlers are accessing your content — in 48 hours.
Get your free audit