AgentScan logoAgentScan
Reference7 min read

Every AI bot user agent worth knowing in 2026

The complete reference: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, and more. Owner, purpose, and recommended directive for each.

Every AI bot user agent worth knowing in 2026

Every AI bot user agent worth knowing in 2026

A working reference for every AI crawler your site is likely to see. Each entry covers the user-agent string, the company behind it, what it actually does, and the most common policy stance. Use it to write a robots.txt that does what you mean.

OpenAI

GPTBot

  • Owner: OpenAI
  • Purpose: Training data crawler for OpenAI models.
  • Common policy: Disallow if you want to opt out of training.
  • Notes: This is the bot that crawls public pages to build training datasets. Honors robots.txt. Distinct from the other OpenAI agents.

ChatGPT-User

  • Owner: OpenAI
  • Purpose: On-demand fetch when a ChatGPT user asks the assistant to read a URL.
  • Common policy: Allow. The traffic is user-initiated and produces high-intent referrals.

OAI-SearchBot

  • Owner: OpenAI
  • Purpose: Indexing for OpenAI search products like ChatGPT search.
  • Common policy: Allow if you want OpenAI search visibility. Disallow if you do not.

Anthropic

ClaudeBot

  • Owner: Anthropic
  • Purpose: Primary content crawler. Used for training and retrieval.
  • Common policy: Disallow to opt out of Anthropic training.

anthropic-ai

  • Owner: Anthropic
  • Purpose: Older user-agent identifier still seen in some traffic.
  • Common policy: Match whatever you set for ClaudeBot for consistency.

Claude-Web

  • Owner: Anthropic
  • Purpose: On-demand fetch when a Claude user asks the assistant to read a URL.
  • Common policy: Allow. User-initiated.

Google

Googlebot

  • Owner: Google
  • Purpose: Standard Google Search crawler.
  • Common policy: Allow. Blocking it removes you from Google Search.

Google-Extended

  • Owner: Google
  • Purpose: Opt-out token for Gemini and other Google AI training. Does not affect Google Search.
  • Common policy: Disallow if you want to opt out of Gemini training while keeping Google Search.

This is the most-misunderstood bot. People who want to "block Gemini" sometimes block Googlebot by mistake and lose all search traffic. Use Google-Extended.

AdsBot-Google

  • Owner: Google
  • Purpose: Auditing landing-page quality for Google Ads.
  • Common policy: Allow if you run Google Ads.

Mediapartners-Google

  • Owner: Google
  • Purpose: Google AdSense crawler for ad targeting.
  • Common policy: Allow if you run AdSense.

Apple

Applebot

  • Owner: Apple
  • Purpose: Apple Search and Siri.
  • Common policy: Allow. Powers Spotlight and Siri results.

Applebot-Extended

  • Owner: Apple
  • Purpose: Opt-out token for Apple Intelligence training. Does not affect Apple Search.
  • Common policy: Disallow to opt out of Apple Intelligence training.

Same pattern as Google-Extended. Different bot for AI training, search keeps working.

Microsoft

Bingbot

  • Owner: Microsoft
  • Purpose: Bing Search.
  • Common policy: Allow. Powers Bing, DuckDuckGo, Yahoo, and Microsoft Copilot results.

MSNBot-Media (legacy)

Largely deprecated in 2026. Bingbot covers most use cases now.

Other AI vendors

PerplexityBot

  • Owner: Perplexity
  • Purpose: Indexing content for Perplexity answers.
  • Common policy: Allow if you want citations in Perplexity. Disallow if you do not.
  • Notes: Perplexity is one of the most active answer engines. Allowing it usually produces visible referral traffic.

Bytespider

  • Owner: ByteDance
  • Purpose: Training data crawler for ByteDance models (Doubao and others).
  • Common policy: Disallow if you do not want to feed ByteDance training. Has a history of aggressive crawling.

CCBot

  • Owner: Common Crawl Foundation
  • Purpose: Building the public Common Crawl dataset.
  • Common policy: Many AI training datasets start with Common Crawl. Disallow CCBot to remove your content from that pipeline.

Meta-ExternalAgent

  • Owner: Meta
  • Purpose: Ingestion for Meta AI products.
  • Common policy: Disallow to opt out of Meta AI training.

FacebookBot

  • Owner: Meta
  • Purpose: Translation training data.
  • Common policy: Niche. Most sites can ignore.

DuckAssistBot

  • Owner: DuckDuckGo
  • Purpose: Indexing for the DuckAssist answer feature.
  • Common policy: Allow if you want DuckDuckGo AI visibility.

Amazonbot

  • Owner: Amazon
  • Purpose: Alexa, Amazon AI services.
  • Common policy: Allow unless you specifically want to block Amazon AI ingestion.

Diffbot

  • Owner: Diffbot
  • Purpose: Knowledge graph crawler. Powers many third-party AI knowledge products.
  • Common policy: Allow unless you want to opt out of third-party AI ingestion.

SEO crawlers (not AI, but easy to confuse)

These are not AI bots. They are SEO tools that crawl for backlink and competitive analysis. They do not feed AI training. Listed here only because they often appear in robots.txt files alongside AI bots.

  • AhrefsBot (Ahrefs)
  • SemrushBot (Semrush)
  • MJ12bot (Majestic)
  • DotBot (Moz)

Most SEO professionals allow all of them. Sites that block them lose access to the corresponding tool's data about themselves.

Three policy templates

Most public content sites

Allow search, block training:

User-agent: *
Allow: /
Content-Signal: ai-train=no, search=yes, ai-input=no

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

Sitemap: https://yourdomain/sitemap.xml

Maximum AI visibility

Most sites should not pick this, but here it is:

User-agent: *
Allow: /
Content-Signal: ai-train=yes, search=yes, ai-input=yes

Sitemap: https://yourdomain/sitemap.xml

Block everything

Rare. Removes you from agentic discovery entirely:

User-agent: *
Disallow: /
Content-Signal: ai-train=no, search=no, ai-input=no

How to verify

After deploying, test with the robots.txt tester. Paste your file, choose a bot user agent, and a URL path. Confirm the verdict matches your intent for each major bot.

Why this matters

A robots.txt with a wildcard * group and nothing else is no longer a real policy in 2026. AI vendors look for their specific user agent first. Be explicit about each one. The work is fifteen minutes. The decisions last for years.

Build your file with the robots.txt Generator.