AgentScan logoAgentScan
Strategy7 min read

Verify the bot is real: User-Agent is not auth

User-Agent strings are trivially spoofable. The 2026 stack for authenticating AI crawlers: reverse DNS, IP allow lists, and Web Bot Auth signed requests.

Verify the bot is real: User-Agent is not auth

Verify the bot is real: User-Agent is not auth

Every popular AI crawler has been impersonated. The User-Agent header is a self-asserted string and any client can claim any value. If your site treats User-Agent: GPTBot as proof that the request is actually from OpenAI, you have a security and analytics gap.

This is the practical stack for verifying AI crawlers in 2026: reverse DNS, IP allow lists, and the new Web Bot Auth signed-request standard.

The threat

Three classes of impersonation are common today.

  1. Black-hat scraping dressed up as a friendly crawler to bypass rate limits.
  2. Researchers and competitors spoofing GPTBot to see what OpenAI sees.
  3. Adversarial systems that scrape behind a legitimate-looking UA to harvest content for downstream resale.

Trusting the UA alone means all three get the same access as the real bot. That probably is not what you want.

Layer 1: reverse DNS verification

The classic technique. Major crawlers publish PTR records for their IP ranges. The verification is:

  1. Take the request IP.
  2. Reverse DNS lookup it. Is the result in the vendor's domain?
  3. Forward DNS lookup the PTR result. Does it resolve back to the original IP?

Both directions must match. Without forward verification, an attacker who controls a PTR record could spoof.

Quick reference for major bots in 2026:

BotVerification domain
Googlebot, Google-Extendedgooglebot.com, google.com
Bingbotsearch.msn.com
Applebot, Applebot-Extendedapplebot.apple.com
GPTBotOpenAI publishes IP ranges in JSON form
ClaudeBotAnthropic publishes IP ranges

The vendor docs are the source of truth. They change occasionally.

Code sketch

import { lookup, reverse } from "node:dns/promises";

async function verifyBotIp(ip: string, expectedDomain: string) {
  try {
    const ptrs = await reverse(ip);
    const matched = ptrs.find((host) => host.endsWith(expectedDomain));
    if (!matched) return false;
    const { address } = await lookup(matched);
    return address === ip;
  } catch {
    return false;
  }
}

In production, cache the result for a few hours and rate limit the verifications to avoid DNS storms.

Layer 2: IP allow lists

Some vendors publish their IP ranges as JSON. OpenAI does this for GPTBot, ChatGPT-User, and OAI-SearchBot. Anthropic does it for ClaudeBot. Apple and Google have similar publications.

The pattern: fetch the published JSON on a schedule (daily), parse the CIDR ranges, and check incoming IPs against them.

function ipInRange(ip: string, cidrs: string[]): boolean {
  // ... CIDR matching logic
}

Pair it with reverse DNS for defense in depth.

Layer 3: Web Bot Auth (cryptographic)

The newest piece in the stack. Web Bot Auth is an emerging standard where bots sign their requests with a cryptographic key. The site fetches the bot's public key (often via /.well-known/web-bot-auth-keys or a vendor directory), verifies the signature on the request, and accepts it as authenticated.

The advantages over DNS-based verification:

  • No round trip to DNS.
  • Signature binds to specific request fields, so replay attacks are harder.
  • Works the same regardless of IP rotation.

The disadvantages:

  • Adoption in 2026 is still partial. Not every bot signs.
  • Key rotation is a complication.

When a vendor supports it, prefer Web Bot Auth and fall back to reverse DNS for older bots.

The decision tree

For each incoming request that claims to be a major AI bot:

  1. If the bot supports Web Bot Auth and your stack is modern: verify the signature. Done.
  2. Otherwise: do reverse DNS verification. Cache the result.
  3. As a coarse first filter: check the IP against the published allow list.

If any of those succeed, treat the request as the claimed bot. If none succeed, treat it as an unknown client.

What to do with verified vs unverified bots

Different policies for different verification states.

Verification stateCommon policy
Verified AI botApply your robots.txt and Content-Signal rules
Unverified but UA claims AI botRate limit, log, and (optionally) serve a 403
Unknown UAStandard human rate limit

The middle row is the important one. A request that claims User-Agent: GPTBot but fails verification is impersonating; treat it more strictly than an honest unknown client.

Logging and observability

Whatever your policy, log the verification result. Useful fields:

  • IP
  • Claimed UA
  • Verification method (none, ip-list, reverse-dns, web-bot-auth)
  • Verification result (verified, failed, skipped)
  • Action taken (allow, rate-limit, block)

Aggregate weekly. The first time you do this, you will see the spoofing volume on your site clearly. It is usually higher than expected.

What we recommend right now

For most sites:

  1. Allow all real major AI bots unmodified. Treat them as user agents that bring traffic-equivalent value.
  2. Do reverse DNS verification on requests that claim to be GPTBot, ClaudeBot, Googlebot, Bingbot, or Applebot. Cache results for 4 hours.
  3. Rate limit unverified claims to 1 req/sec per IP. Log them.
  4. Add Web Bot Auth when your stack is ready. Prioritize the bots your traffic actually shows.

For sites with sensitive content, layer in a CDN-level WAF rule that requires verified status before access to specific paths.

Tooling

The User-Agent Lookup on AgentScan identifies a UA string against 45+ known agents. It is the first step before deciding what to verify. Combine it with the verification stack above for full coverage.

Why this matters

UA-based bot policy in 2026 is a polite suggestion. Real bot policy requires authentication. The work to add reverse DNS is small. The benefit, in cleaner analytics and reduced impersonation risk, compounds.

If you treat your robots.txt as an access control mechanism, layer in verification or your access control is theatrical.