Every AI bot user agent worth knowing in 2026
A working reference for every AI crawler your site is likely to see. Each entry covers the user-agent string, the company behind it, what it actually does, and the most common policy stance. Use it to write a robots.txt that does what you mean.
OpenAI
GPTBot
- Owner: OpenAI
- Purpose: Training data crawler for OpenAI models.
- Common policy: Disallow if you want to opt out of training.
- Notes: This is the bot that crawls public pages to build training datasets. Honors robots.txt. Distinct from the other OpenAI agents.
ChatGPT-User
- Owner: OpenAI
- Purpose: On-demand fetch when a ChatGPT user asks the assistant to read a URL.
- Common policy: Allow. The traffic is user-initiated and produces high-intent referrals.
OAI-SearchBot
- Owner: OpenAI
- Purpose: Indexing for OpenAI search products like ChatGPT search.
- Common policy: Allow if you want OpenAI search visibility. Disallow if you do not.
Anthropic
ClaudeBot
- Owner: Anthropic
- Purpose: Primary content crawler. Used for training and retrieval.
- Common policy: Disallow to opt out of Anthropic training.
anthropic-ai
- Owner: Anthropic
- Purpose: Older user-agent identifier still seen in some traffic.
- Common policy: Match whatever you set for ClaudeBot for consistency.
Claude-Web
- Owner: Anthropic
- Purpose: On-demand fetch when a Claude user asks the assistant to read a URL.
- Common policy: Allow. User-initiated.
Googlebot
- Owner: Google
- Purpose: Standard Google Search crawler.
- Common policy: Allow. Blocking it removes you from Google Search.
Google-Extended
- Owner: Google
- Purpose: Opt-out token for Gemini and other Google AI training. Does not affect Google Search.
- Common policy: Disallow if you want to opt out of Gemini training while keeping Google Search.
This is the most-misunderstood bot. People who want to "block Gemini" sometimes block Googlebot by mistake and lose all search traffic. Use Google-Extended.
AdsBot-Google
- Owner: Google
- Purpose: Auditing landing-page quality for Google Ads.
- Common policy: Allow if you run Google Ads.
Mediapartners-Google
- Owner: Google
- Purpose: Google AdSense crawler for ad targeting.
- Common policy: Allow if you run AdSense.
Apple
Applebot
- Owner: Apple
- Purpose: Apple Search and Siri.
- Common policy: Allow. Powers Spotlight and Siri results.
Applebot-Extended
- Owner: Apple
- Purpose: Opt-out token for Apple Intelligence training. Does not affect Apple Search.
- Common policy: Disallow to opt out of Apple Intelligence training.
Same pattern as Google-Extended. Different bot for AI training, search keeps working.
Microsoft
Bingbot
- Owner: Microsoft
- Purpose: Bing Search.
- Common policy: Allow. Powers Bing, DuckDuckGo, Yahoo, and Microsoft Copilot results.
MSNBot-Media (legacy)
Largely deprecated in 2026. Bingbot covers most use cases now.
Other AI vendors
PerplexityBot
- Owner: Perplexity
- Purpose: Indexing content for Perplexity answers.
- Common policy: Allow if you want citations in Perplexity. Disallow if you do not.
- Notes: Perplexity is one of the most active answer engines. Allowing it usually produces visible referral traffic.
Bytespider
- Owner: ByteDance
- Purpose: Training data crawler for ByteDance models (Doubao and others).
- Common policy: Disallow if you do not want to feed ByteDance training. Has a history of aggressive crawling.
CCBot
- Owner: Common Crawl Foundation
- Purpose: Building the public Common Crawl dataset.
- Common policy: Many AI training datasets start with Common Crawl. Disallow CCBot to remove your content from that pipeline.
Meta-ExternalAgent
- Owner: Meta
- Purpose: Ingestion for Meta AI products.
- Common policy: Disallow to opt out of Meta AI training.
FacebookBot
- Owner: Meta
- Purpose: Translation training data.
- Common policy: Niche. Most sites can ignore.
DuckAssistBot
- Owner: DuckDuckGo
- Purpose: Indexing for the DuckAssist answer feature.
- Common policy: Allow if you want DuckDuckGo AI visibility.
Amazonbot
- Owner: Amazon
- Purpose: Alexa, Amazon AI services.
- Common policy: Allow unless you specifically want to block Amazon AI ingestion.
Diffbot
- Owner: Diffbot
- Purpose: Knowledge graph crawler. Powers many third-party AI knowledge products.
- Common policy: Allow unless you want to opt out of third-party AI ingestion.
SEO crawlers (not AI, but easy to confuse)
These are not AI bots. They are SEO tools that crawl for backlink and competitive analysis. They do not feed AI training. Listed here only because they often appear in robots.txt files alongside AI bots.
- AhrefsBot (Ahrefs)
- SemrushBot (Semrush)
- MJ12bot (Majestic)
- DotBot (Moz)
Most SEO professionals allow all of them. Sites that block them lose access to the corresponding tool's data about themselves.
Three policy templates
Most public content sites
Allow search, block training:
User-agent: *
Allow: /
Content-Signal: ai-train=no, search=yes, ai-input=no
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
Sitemap: https://yourdomain/sitemap.xmlMaximum AI visibility
Most sites should not pick this, but here it is:
User-agent: *
Allow: /
Content-Signal: ai-train=yes, search=yes, ai-input=yes
Sitemap: https://yourdomain/sitemap.xmlBlock everything
Rare. Removes you from agentic discovery entirely:
User-agent: *
Disallow: /
Content-Signal: ai-train=no, search=no, ai-input=noHow to verify
After deploying, test with the robots.txt tester. Paste your file, choose a bot user agent, and a URL path. Confirm the verdict matches your intent for each major bot.
Why this matters
A robots.txt with a wildcard * group and nothing else is no longer a real policy in 2026. AI vendors look for their specific user agent first. Be explicit about each one. The work is fifteen minutes. The decisions last for years.
Build your file with the robots.txt Generator.