AgentScan logoAgentScan
Reference6 min read

Content-Signal: declaring AI usage policy

The robots.txt extension that lets you allow search but block training. The exact syntax, valid combinations, and what happens when vendors honor it.

Content-Signal: declaring AI usage policy

Content-Signal: declaring AI usage policy

Content-Signal is the robots.txt extension that finally separates "appears in search" from "feeds an AI model" from "is quoted by an answer engine". Without it, you had to choose all-or-nothing per crawler. With it, you can say what you actually mean in one line.

This guide is the syntax reference, the eight valid combinations, and what each combination signals to AI vendors.

The wire format

A single directive inside a User-agent group:

User-agent: *
Allow: /
Content-Signal: ai-train=no, search=yes, ai-input=no

Three keys, two values each. Comma-separated. Place inside a User-agent block, never above the first one.

KeyWhat it controls
ai-trainWhether this content can be used to train AI models
searchWhether this content can be indexed for search results, including AI search
ai-inputWhether this content can be used as live input for AI features (RAG, answer engines)

The eight valid combinations

Three keys, two values each, eight combinations. Most sites only ship two or three of them.

1. Open everything

Content-Signal: ai-train=yes, search=yes, ai-input=yes

The default behavior of most public sites today. Maximum visibility, maximum AI exposure.

2. Public site, no training

Content-Signal: ai-train=no, search=yes, ai-input=no

The most common stance for original content sites in 2026. Stay searchable, refuse to be a training dataset.

3. Searchable but not quotable

Content-Signal: ai-train=no, search=yes, ai-input=no

Same as above. The ai-input=no blocks live citations in answer engines too. Use this when you want SERP traffic but not AI summaries.

4. Searchable, AI-quotable, not training

Content-Signal: ai-train=no, search=yes, ai-input=yes

A nuanced middle ground. Lets answer engines like Perplexity quote your content with a citation, but blocks training. Recommended for many publishers.

5. Training only

Content-Signal: ai-train=yes, search=no, ai-input=no

Niche. You want your content in models but not in search. Real-world example: dataset providers and academic corpora.

6. Block everything

Content-Signal: ai-train=no, search=no, ai-input=no

Removes you from AI-mediated discovery and search. Rare and intentional.

7. Search only, plus training

Content-Signal: ai-train=yes, search=yes, ai-input=no

Allow training, allow search, block live AI input. Uncommon.

8. AI input only

Content-Signal: ai-train=no, search=no, ai-input=yes

Niche. You want to be quoted by AI but not crawled for training or indexed for search. Hard to enforce in practice.

The default if you do nothing

No Content-Signal line means no signal. Vendors fall back to their default behavior, usually "ai-train=yes, search=yes, ai-input=yes" unless the User-agent rules tell them otherwise.

If you want to opt out of training, you must say so explicitly. Silence is consent.

Where to place the directive

Always inside a User-agent group. The most common placement:

User-agent: *
Allow: /
Content-Signal: ai-train=no, search=yes, ai-input=no

User-agent: GPTBot
Disallow: /

The wildcard group sets the default. The specific group adds belt-and-suspenders enforcement for one bot.

You can also put a per-bot Content-Signal line if you want different policy per vendor:

User-agent: GPTBot
Content-Signal: ai-train=no, search=yes, ai-input=no

User-agent: PerplexityBot
Content-Signal: ai-train=no, search=yes, ai-input=yes

What this does not do

  • Does not enforce anything. Compliance is voluntary, like the rest of robots.txt. Most reputable AI vendors honor it. Some do not.
  • Does not affect old training data. A model already trained on your content is unaffected. The signal only governs future fetches.
  • Does not block determined scraping. If you need actual access control, layer in IP allow lists, Web Bot Auth, and rate limits on top.

Compatibility with per-bot rules

Content-Signal is parsed by vendors that opt in. Vendors that ignore it fall back to their User-agent group rules. Both patterns can coexist without conflict.

A safe defensive pattern: declare your intent in Content-Signal AND add explicit User-agent groups for the bots you want to allow or block.

Verifying

After deploying:

curl -s https://yourdomain.com/robots.txt | grep -i 'content-signal'

You should see your line. Confirm the values match what you intended. There is no public testing harness for vendor compliance with Content-Signal yet, so the verification ends there.

Build it

Use the Content-Signal Builder to compose a directive and copy a robots.txt block. Pair it with the robots.txt Generator for the full crawler-policy file.

What we ship at AgentScan

This site uses Content-Signal: ai-train=no, search=yes, ai-input=no under the wildcard group, plus explicit Allow rules for the major AI bots so the on-demand fetching path works. The intent: public visibility, no training contribution.

If you have not declared your intent yet, you have not declined training. Declining is a one-line change.