Monitor AI bots in your server logs

Search Console tells you what Googlebot crawled. There is no equivalent dashboard for GPTBot, ClaudeBot, or PerplexityBot. The data exists — it is in your access logs — but you have to extract it yourself.

This is the practical recipe: which fields to capture, how to identify AI bots, and what to do with the data.

What you want to know

Three questions every site should be able to answer about AI bots in 2026:

Which AI bots crawl my site, how often, and which pages?
Are they getting clean 200 responses or hitting errors?
Are unverified clients impersonating major AI bots?

If you cannot answer these from your current logging, you have a blind spot.

Required log fields

Many default log formats are missing fields you will want. The minimum:

IP (or x-forwarded-for chain)
Timestamp
Method
Path (with query string)
Status code
Bytes sent
User-Agent
Referer
Response time

For agent-specific analysis, also capture:

The Accept header (to detect markdown negotiation)
The X-Bot-Verified custom header you set after reverse DNS verification (if you implemented that)

Most modern reverse proxies (Nginx, Caddy, CloudFront, Fastly) support custom log formats. Adopt one.

Identify AI bots from the User-Agent

A practical regex set for the major 2026 AI bots:

Bot	Match pattern (case-insensitive)
GPTBot	`\bGPTBot\b`
ChatGPT-User	`\bChatGPT-User\b`
OAI-SearchBot	`\bOAI-SearchBot\b`
ClaudeBot	`\bClaudeBot\b`
anthropic-ai	`\banthropic-ai\b`
Claude-Web	`\bClaude-Web\b`
PerplexityBot	`\bPerplexityBot\b`
Google-Extended	`\bGoogle-Extended\b`
Applebot-Extended	`\bApplebot-Extended\b`
Bytespider	`\bBytespider\b`
CCBot	`\bCCBot\b`
Meta-ExternalAgent	`\bMeta-ExternalAgent\b`
DuckAssistBot	`\bDuckAssistBot\b`

Use word boundaries so Bot does not match every bot in the world.

Combine with the User-Agent Lookup tool to test individual strings as you encounter them.

Distinguish bots from impersonators

A claim is not a verification. To separate real bots from impersonators, you need the verification described in the bot verification guide.

Without verification, your numbers are upper bounds. With verification, you can produce two metrics per bot:

Claimed traffic: requests with the matching User-Agent regardless of source.
Verified traffic: requests with the matching User-Agent that pass reverse DNS or Web Bot Auth.

The gap between the two is a useful security signal.

Five queries worth running weekly

1. Top AI bots by request count

SELECT
  bot_name,
  COUNT(*) AS requests,
  COUNT(DISTINCT path) AS unique_paths,
  AVG(response_time_ms) AS avg_response_ms,
  SUM(CASE WHEN status >= 400 THEN 1 ELSE 0 END) AS errors
FROM access_logs
WHERE bot_name IS NOT NULL
  AND timestamp >= now() - interval '7 days'
GROUP BY bot_name
ORDER BY requests DESC;

2. Most-fetched pages per bot

SELECT
  bot_name,
  path,
  COUNT(*) AS requests
FROM access_logs
WHERE bot_name = 'GPTBot'
  AND timestamp >= now() - interval '7 days'
GROUP BY bot_name, path
ORDER BY requests DESC
LIMIT 20;

This is the single most useful query. It tells you which content AI vendors prioritize. Often surprising.

3. Pages returning errors to AI bots

SELECT
  bot_name,
  status,
  path,
  COUNT(*) AS errors
FROM access_logs
WHERE bot_name IS NOT NULL
  AND status >= 400
  AND timestamp >= now() - interval '7 days'
GROUP BY bot_name, status, path
ORDER BY errors DESC
LIMIT 30;

A 404 to GPTBot is content that will not appear in OpenAI's data on your site. Fix the high-error pages first.

4. Markdown negotiation usage

SELECT
  bot_name,
  COUNT(*) AS markdown_requests
FROM access_logs
WHERE accept_header LIKE '%text/markdown%'
  AND bot_name IS NOT NULL
  AND timestamp >= now() - interval '30 days'
GROUP BY bot_name
ORDER BY markdown_requests DESC;

If you ship markdown content negotiation, this tells you whether bots are actually using it.

5. Impersonation rate

SELECT
  bot_name,
  COUNT(*) AS claimed,
  SUM(CASE WHEN bot_verified THEN 1 ELSE 0 END) AS verified,
  COUNT(*) - SUM(CASE WHEN bot_verified THEN 1 ELSE 0 END) AS unverified
FROM access_logs
WHERE bot_name IS NOT NULL
  AND timestamp >= now() - interval '7 days'
GROUP BY bot_name
ORDER BY unverified DESC;

The unverified column shows the impersonation pressure on each bot identity. Often higher than the verified column for the popular bots.

A minimal dashboard

If you do nothing else, build a single dashboard with three panels:

Daily request count by bot for the last 30 days. Watch for sudden spikes (often crawl bursts after a release) or drops (often a robots.txt change broke something).
Top 20 pages crawled by AI bots for the last 7 days. Tells you what AI vendors find interesting on your site.
Errors served to AI bots for the last 7 days. The remediation list.

Three panels. Maybe a hundred lines of SQL or Vector config. Enough to know what is happening.

Tools that help

If you do not want to build it from scratch:

GoAccess for Nginx/Apache logs. Fast, terminal-based, has bot detection.
Vector + Grafana for structured log pipelines.
AWS Athena if your logs are already in S3.
Cloudflare logs include a WAF field plus Cloudflare's own bot management classification.

For ad-hoc investigation, the User-Agent Lookup tool is a useful sanity check on any specific UA you encounter.

What good looks like

A site with healthy AI bot observability in 2026:

Knows the top 5 AI bots by traffic.
Has zero 4xx errors served to verified bots on canonical content.
Spots impersonation spikes within a day.
Reviews crawl patterns monthly to catch regressions.

This is achievable with a couple of days of work and a small dashboard. The compounding return is large because every other agentic SEO decision depends on knowing what the bots actually see.

Why this matters

You cannot improve what you cannot measure. Most teams in 2026 are flying blind on AI bot traffic. Getting basic observability in place is a one-week project that pays off whenever any AI optimization decision needs to be made. Start with a regex over your access logs. Iterate from there.