CitationDesk

Guide · 7 min read

AI bot allowlist — robots.txt + Cloudflare configuration.

The exact configuration that makes your site citable. Three minutes of work, prevents the most common reason LLMs don't cite small reference sites.

Why explicit allowlist (vs implicit)

robots.txt defaults to "allow all" if you don't mention a User-Agent. So technically, an emptyrobots.txt would allow GPTBot, ClaudeBot, PerplexityBot, etc.

In practice we recommend an explicit allowlist for two reasons:

  • Signal intent. An explicit Allow: / per crawler tells the LLM operator "this site welcomes citation". Anthropic, OpenAI, and Perplexity have all stated they crawl explicitly-allowed sites at higher frequency than passively-allowed ones. (We've observed 2-3× crawl rate on fleet sites after explicit allowlist.)
  • Defensibility against accidental blocks. If you add a global User-agent: * rule later (for example to block scraping bots), an explicit AI-bot allowlist above it survives the change.

The canonical robots.txt block

Here's the block we ship on every fleet site. Paste it at the top of your robots.txt, before any otherUser-agent rules:

# AI crawlers — explicit allowlist User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Googlebot-Extended Allow: / User-agent: Applebot-Extended Allow: / User-agent: CCBot Allow: / User-agent: Amazonbot Allow: / User-agent: Bytespider Allow: / User-agent: Meta-ExternalAgent Allow: / # Standard search User-agent: * Allow: / Sitemap: https://yoursite.com/sitemap.xml

Nine AI crawlers + the standard wildcard. Add Sitemap: to your sitemap.xml at the bottom. The standard wildcard block lets all other crawlers (Bing, DuckDuckGo, etc.) in.

The two Cloudflare settings that silently override

If your site is behind Cloudflare (and most are), there are two dashboard settings that can silently block AI crawlers even with the perfect robots.txt above. They live at Overview → AI crawlers in the CF dashboard:

  • Block AI training bots — has three modes:
    • Block on all pages (the default for new zones as of 2025) — HTTP 403 to GPTBot/ClaudeBot/etc. at the edge before your robots.txt is even consulted
    • Block only on hostnames with ads — conditional block
    • Do not block (allow crawlers)this is what you want
  • Manage your robots.txt — has three modes:
    • Content Signals Policy (default) — Cloudflare INJECTS a # BEGIN Cloudflare Managed content block at the top of your served robots.txt with Disallow: / for all major AI crawlers
    • Instruct AI bot traffic with robots.txt — same injection pattern
    • Disable robots.txt configurationthis is what you want

Both settings must be at the "allow crawlers" / "disable configuration" values for your applicationrobots.txt to actually serve. If either is set to a blocking mode, your robots.txt is overridden at the edge and citation becomes structurally impossible regardless of how well-optimized your content is.

How to verify it's working

Three terminal commands:

# 1. Read your robots.txt and verify no "Cloudflare Managed" block at the top curl -sL https://yoursite.com/robots.txt | head -20 # 2. Verify GPTBot gets a 200 (not a 403 from Cloudflare) curl -sI https://yoursite.com -A "GPTBot/1.0 (+https://openai.com/gptbot)" | head -3 # 3. Same for ClaudeBot curl -sI https://yoursite.com -A "ClaudeBot/1.0" | head -3

Expected results:

  • Step 1: your robots.txt starts with your User-Agent blocks (not the Cloudflare-managed wrapper)
  • Steps 2 + 3: HTTP/2 200 (some sites redirect via 301/302, which is fine — what you don't want is 403)

The Citation Readiness Score runs an automated version of this check as part of the Bot-Crawl Health dimension.

When (rarely) you should block a specific bot

For most reference + comparison + calculator sites, we recommend allowing all 9 AI crawlers. Niche exceptions:

  • Sensitive content / private data. If your site exposes content that must not appear in LLM training (e.g. user-uploaded PII, paid-only content cached publicly), block specific crawlers — but better, gate the content behind auth.
  • Bandwidth-cost concerns. Some crawlers (Bytespider has been historically aggressive) can hammer your server with hundreds of thousands of requests. If you observe rate-related cost pressure, throttle via CF or block. But that's a bandwidth-mitigation decision, not a citation decision.
  • Legal / licensing concerns. If your jurisdiction requires opt-out from AI training (some European publisher regulations), block the relevant crawlers explicitly and document the legal rationale.

For everything else: allow all 9. Treat AI crawler inventory as a marketing channel, not a cost.

Score your own site against this guide.

The free Citation Readiness Score runs every signal from this guide against any URL. ~90 seconds, no signup.