Every AI bot user agent worth knowing in 2026

A working reference for every AI crawler your site is likely to see. Each entry covers the user-agent string, the company behind it, what it actually does, and the most common policy stance. Use it to write a robots.txt that does what you mean.

OpenAI

GPTBot

Owner: OpenAI
Purpose: Training data crawler for OpenAI models.
Common policy: Disallow if you want to opt out of training.
Notes: This is the bot that crawls public pages to build training datasets. Honors robots.txt. Distinct from the other OpenAI agents.

ChatGPT-User

Owner: OpenAI
Purpose: On-demand fetch when a ChatGPT user asks the assistant to read a URL.
Common policy: Allow. The traffic is user-initiated and produces high-intent referrals.

OAI-SearchBot

Owner: OpenAI
Purpose: Indexing for OpenAI search products like ChatGPT search.
Common policy: Allow if you want OpenAI search visibility. Disallow if you do not.

Anthropic

ClaudeBot

Owner: Anthropic
Purpose: Primary content crawler. Used for training and retrieval.
Common policy: Disallow to opt out of Anthropic training.

anthropic-ai

Owner: Anthropic
Purpose: Older user-agent identifier still seen in some traffic.
Common policy: Match whatever you set for ClaudeBot for consistency.

Claude-Web

Owner: Anthropic
Purpose: On-demand fetch when a Claude user asks the assistant to read a URL.
Common policy: Allow. User-initiated.

Google

Googlebot

Owner: Google
Purpose: Standard Google Search crawler.
Common policy: Allow. Blocking it removes you from Google Search.

Google-Extended

Owner: Google
Purpose: Opt-out token for Gemini and other Google AI training. Does not affect Google Search.
Common policy: Disallow if you want to opt out of Gemini training while keeping Google Search.

This is the most-misunderstood bot. People who want to "block Gemini" sometimes block Googlebot by mistake and lose all search traffic. Use Google-Extended.

AdsBot-Google

Owner: Google
Purpose: Auditing landing-page quality for Google Ads.
Common policy: Allow if you run Google Ads.

Mediapartners-Google

Owner: Google
Purpose: Google AdSense crawler for ad targeting.
Common policy: Allow if you run AdSense.

Apple

Applebot

Owner: Apple
Purpose: Apple Search and Siri.
Common policy: Allow. Powers Spotlight and Siri results.

Applebot-Extended

Owner: Apple
Purpose: Opt-out token for Apple Intelligence training. Does not affect Apple Search.
Common policy: Disallow to opt out of Apple Intelligence training.

Same pattern as Google-Extended. Different bot for AI training, search keeps working.

Microsoft

Bingbot

Owner: Microsoft
Purpose: Bing Search.
Common policy: Allow. Powers Bing, DuckDuckGo, Yahoo, and Microsoft Copilot results.

MSNBot-Media (legacy)

Largely deprecated in 2026. Bingbot covers most use cases now.

Other AI vendors

PerplexityBot

Owner: Perplexity
Purpose: Indexing content for Perplexity answers.
Common policy: Allow if you want citations in Perplexity. Disallow if you do not.
Notes: Perplexity is one of the most active answer engines. Allowing it usually produces visible referral traffic.

Bytespider

Owner: ByteDance
Purpose: Training data crawler for ByteDance models (Doubao and others).
Common policy: Disallow if you do not want to feed ByteDance training. Has a history of aggressive crawling.

CCBot

Owner: Common Crawl Foundation
Purpose: Building the public Common Crawl dataset.
Common policy: Many AI training datasets start with Common Crawl. Disallow CCBot to remove your content from that pipeline.

Meta-ExternalAgent

Owner: Meta
Purpose: Ingestion for Meta AI products.
Common policy: Disallow to opt out of Meta AI training.

FacebookBot

Owner: Meta
Purpose: Translation training data.
Common policy: Niche. Most sites can ignore.

DuckAssistBot

Owner: DuckDuckGo
Purpose: Indexing for the DuckAssist answer feature.
Common policy: Allow if you want DuckDuckGo AI visibility.

Amazonbot

Owner: Amazon
Purpose: Alexa, Amazon AI services.
Common policy: Allow unless you specifically want to block Amazon AI ingestion.

Diffbot

Owner: Diffbot
Purpose: Knowledge graph crawler. Powers many third-party AI knowledge products.
Common policy: Allow unless you want to opt out of third-party AI ingestion.

SEO crawlers (not AI, but easy to confuse)

These are not AI bots. They are SEO tools that crawl for backlink and competitive analysis. They do not feed AI training. Listed here only because they often appear in robots.txt files alongside AI bots.

AhrefsBot (Ahrefs)
SemrushBot (Semrush)
MJ12bot (Majestic)
DotBot (Moz)

Most SEO professionals allow all of them. Sites that block them lose access to the corresponding tool's data about themselves.

Three policy templates

Most public content sites

Allow search, block training:

User-agent: *
Allow: /
Content-Signal: ai-train=no, search=yes, ai-input=no

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

Sitemap: https://yourdomain/sitemap.xml

Maximum AI visibility

Most sites should not pick this, but here it is:

User-agent: *
Allow: /
Content-Signal: ai-train=yes, search=yes, ai-input=yes

Sitemap: https://yourdomain/sitemap.xml

Block everything

Rare. Removes you from agentic discovery entirely:

User-agent: *
Disallow: /
Content-Signal: ai-train=no, search=no, ai-input=no

How to verify

After deploying, test with the robots.txt tester. Paste your file, choose a bot user agent, and a URL path. Confirm the verdict matches your intent for each major bot.

Why this matters

A robots.txt with a wildcard * group and nothing else is no longer a real policy in 2026. AI vendors look for their specific user agent first. Be explicit about each one. The work is fifteen minutes. The decisions last for years.

Build your file with the robots.txt Generator.