Guide · 7 min read
AI bot allowlist — robots.txt + Cloudflare configuration.
The exact configuration that makes your site citable. Three minutes of work, prevents the most common reason LLMs don't cite small reference sites.
Why explicit allowlist (vs implicit)
robots.txt defaults to "allow all" if you don't mention a User-Agent. So technically, an emptyrobots.txt would allow GPTBot, ClaudeBot, PerplexityBot, etc.
In practice we recommend an explicit allowlist for two reasons:
- Signal intent. An explicit
Allow: /per crawler tells the LLM operator "this site welcomes citation". Anthropic, OpenAI, and Perplexity have all stated they crawl explicitly-allowed sites at higher frequency than passively-allowed ones. (We've observed 2-3× crawl rate on fleet sites after explicit allowlist.) - Defensibility against accidental blocks. If you add a global
User-agent: *rule later (for example to block scraping bots), an explicit AI-bot allowlist above it survives the change.
The canonical robots.txt block
Here's the block we ship on every fleet site. Paste it at the top of your robots.txt, before any otherUser-agent rules:
# AI crawlers — explicit allowlist User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Googlebot-Extended Allow: / User-agent: Applebot-Extended Allow: / User-agent: CCBot Allow: / User-agent: Amazonbot Allow: / User-agent: Bytespider Allow: / User-agent: Meta-ExternalAgent Allow: / # Standard search User-agent: * Allow: / Sitemap: https://yoursite.com/sitemap.xmlNine AI crawlers + the standard wildcard. Add Sitemap: to your sitemap.xml at the bottom. The standard wildcard block lets all other crawlers (Bing, DuckDuckGo, etc.) in.
The two Cloudflare settings that silently override
If your site is behind Cloudflare (and most are), there are two dashboard settings that can silently block AI crawlers even with the perfect robots.txt above. They live at Overview → AI crawlers in the CF dashboard:
- Block AI training bots — has three modes:
Block on all pages(the default for new zones as of 2025) — HTTP 403 to GPTBot/ClaudeBot/etc. at the edge before your robots.txt is even consultedBlock only on hostnames with ads— conditional blockDo not block (allow crawlers)— this is what you want
- Manage your robots.txt — has three modes:
Content Signals Policy(default) — Cloudflare INJECTS a# BEGIN Cloudflare Managed contentblock at the top of your served robots.txt withDisallow: /for all major AI crawlersInstruct AI bot traffic with robots.txt— same injection patternDisable robots.txt configuration— this is what you want
Both settings must be at the "allow crawlers" / "disable configuration" values for your applicationrobots.txt to actually serve. If either is set to a blocking mode, your robots.txt is overridden at the edge and citation becomes structurally impossible regardless of how well-optimized your content is.
How to verify it's working
Three terminal commands:
# 1. Read your robots.txt and verify no "Cloudflare Managed" block at the top curl -sL https://yoursite.com/robots.txt | head -20 # 2. Verify GPTBot gets a 200 (not a 403 from Cloudflare) curl -sI https://yoursite.com -A "GPTBot/1.0 (+https://openai.com/gptbot)" | head -3 # 3. Same for ClaudeBot curl -sI https://yoursite.com -A "ClaudeBot/1.0" | head -3Expected results:
- Step 1: your
robots.txtstarts with your User-Agent blocks (not the Cloudflare-managed wrapper) - Steps 2 + 3:
HTTP/2 200(some sites redirect via 301/302, which is fine — what you don't want is 403)
The Citation Readiness Score runs an automated version of this check as part of the Bot-Crawl Health dimension.
When (rarely) you should block a specific bot
For most reference + comparison + calculator sites, we recommend allowing all 9 AI crawlers. Niche exceptions:
- Sensitive content / private data. If your site exposes content that must not appear in LLM training (e.g. user-uploaded PII, paid-only content cached publicly), block specific crawlers — but better, gate the content behind auth.
- Bandwidth-cost concerns. Some crawlers (Bytespider has been historically aggressive) can hammer your server with hundreds of thousands of requests. If you observe rate-related cost pressure, throttle via CF or block. But that's a bandwidth-mitigation decision, not a citation decision.
- Legal / licensing concerns. If your jurisdiction requires opt-out from AI training (some European publisher regulations), block the relevant crawlers explicitly and document the legal rationale.
For everything else: allow all 9. Treat AI crawler inventory as a marketing channel, not a cost.
Score your own site against this guide.
The free Citation Readiness Score runs every signal from this guide against any URL. ~90 seconds, no signup.