Verify the bot is real: User-Agent is not auth
Every popular AI crawler has been impersonated. The User-Agent header is a self-asserted string and any client can claim any value. If your site treats User-Agent: GPTBot as proof that the request is actually from OpenAI, you have a security and analytics gap.
This is the practical stack for verifying AI crawlers in 2026: reverse DNS, IP allow lists, and the new Web Bot Auth signed-request standard.
The threat
Three classes of impersonation are common today.
- Black-hat scraping dressed up as a friendly crawler to bypass rate limits.
- Researchers and competitors spoofing GPTBot to see what OpenAI sees.
- Adversarial systems that scrape behind a legitimate-looking UA to harvest content for downstream resale.
Trusting the UA alone means all three get the same access as the real bot. That probably is not what you want.
Layer 1: reverse DNS verification
The classic technique. Major crawlers publish PTR records for their IP ranges. The verification is:
- Take the request IP.
- Reverse DNS lookup it. Is the result in the vendor's domain?
- Forward DNS lookup the PTR result. Does it resolve back to the original IP?
Both directions must match. Without forward verification, an attacker who controls a PTR record could spoof.
Quick reference for major bots in 2026:
| Bot | Verification domain |
|---|---|
| Googlebot, Google-Extended | googlebot.com, google.com |
| Bingbot | search.msn.com |
| Applebot, Applebot-Extended | applebot.apple.com |
| GPTBot | OpenAI publishes IP ranges in JSON form |
| ClaudeBot | Anthropic publishes IP ranges |
The vendor docs are the source of truth. They change occasionally.
Code sketch
import { lookup, reverse } from "node:dns/promises";
async function verifyBotIp(ip: string, expectedDomain: string) {
try {
const ptrs = await reverse(ip);
const matched = ptrs.find((host) => host.endsWith(expectedDomain));
if (!matched) return false;
const { address } = await lookup(matched);
return address === ip;
} catch {
return false;
}
}In production, cache the result for a few hours and rate limit the verifications to avoid DNS storms.
Layer 2: IP allow lists
Some vendors publish their IP ranges as JSON. OpenAI does this for GPTBot, ChatGPT-User, and OAI-SearchBot. Anthropic does it for ClaudeBot. Apple and Google have similar publications.
The pattern: fetch the published JSON on a schedule (daily), parse the CIDR ranges, and check incoming IPs against them.
function ipInRange(ip: string, cidrs: string[]): boolean {
// ... CIDR matching logic
}Pair it with reverse DNS for defense in depth.
Layer 3: Web Bot Auth (cryptographic)
The newest piece in the stack. Web Bot Auth is an emerging standard where bots sign their requests with a cryptographic key. The site fetches the bot's public key (often via /.well-known/web-bot-auth-keys or a vendor directory), verifies the signature on the request, and accepts it as authenticated.
The advantages over DNS-based verification:
- No round trip to DNS.
- Signature binds to specific request fields, so replay attacks are harder.
- Works the same regardless of IP rotation.
The disadvantages:
- Adoption in 2026 is still partial. Not every bot signs.
- Key rotation is a complication.
When a vendor supports it, prefer Web Bot Auth and fall back to reverse DNS for older bots.
The decision tree
For each incoming request that claims to be a major AI bot:
- If the bot supports Web Bot Auth and your stack is modern: verify the signature. Done.
- Otherwise: do reverse DNS verification. Cache the result.
- As a coarse first filter: check the IP against the published allow list.
If any of those succeed, treat the request as the claimed bot. If none succeed, treat it as an unknown client.
What to do with verified vs unverified bots
Different policies for different verification states.
| Verification state | Common policy |
|---|---|
| Verified AI bot | Apply your robots.txt and Content-Signal rules |
| Unverified but UA claims AI bot | Rate limit, log, and (optionally) serve a 403 |
| Unknown UA | Standard human rate limit |
The middle row is the important one. A request that claims User-Agent: GPTBot but fails verification is impersonating; treat it more strictly than an honest unknown client.
Logging and observability
Whatever your policy, log the verification result. Useful fields:
- IP
- Claimed UA
- Verification method (none, ip-list, reverse-dns, web-bot-auth)
- Verification result (verified, failed, skipped)
- Action taken (allow, rate-limit, block)
Aggregate weekly. The first time you do this, you will see the spoofing volume on your site clearly. It is usually higher than expected.
What we recommend right now
For most sites:
- Allow all real major AI bots unmodified. Treat them as user agents that bring traffic-equivalent value.
- Do reverse DNS verification on requests that claim to be GPTBot, ClaudeBot, Googlebot, Bingbot, or Applebot. Cache results for 4 hours.
- Rate limit unverified claims to 1 req/sec per IP. Log them.
- Add Web Bot Auth when your stack is ready. Prioritize the bots your traffic actually shows.
For sites with sensitive content, layer in a CDN-level WAF rule that requires verified status before access to specific paths.
Tooling
The User-Agent Lookup on AgentScan identifies a UA string against 45+ known agents. It is the first step before deciding what to verify. Combine it with the verification stack above for full coverage.
Why this matters
UA-based bot policy in 2026 is a polite suggestion. Real bot policy requires authentication. The work to add reverse DNS is small. The benefit, in cleaner analytics and reduced impersonation risk, compounds.
If you treat your robots.txt as an access control mechanism, layer in verification or your access control is theatrical.