AgentScan logoAgentScan
Reference8 min read

robots.txt Validator Checklist: 9 Errors to Find Before Deploy

Validate robots.txt safely by checking location, syntax, crawler groups, blocked paths, sitemaps, and live Google verification.

robots.txt Validator Checklist: 9 Errors to Find Before Deploy

A robots.txt validator is valuable only if it checks the failure modes that change crawler access. A valid-looking text file can still be hosted at the wrong location, target the wrong crawler, or block a high-value URL.

Use this checklist alongside path testing in the robots.txt Tester.

1. The file is not at the host root

For a URL on https://shop.example.com, the controlling file is:

https://shop.example.com/robots.txt

Files under /static/robots.txt, a different subdomain, or only the http host do not control the target HTTPS host.

2. The response is not usable text

Open the live URL. Look for a success response with text content, not a branded 404 page, login form, CDN denial, or redirect chain. Google Search Console's robots.txt report is the authoritative place to inspect Google's fetch outcome for your verified property.

3. A wildcard rule blocks the entire public site

The following rule is a full-site crawl block:

User-agent: *
Disallow: /

That can be correct for a private staging host, but it is normally a deployment incident on a public marketing or content site.

4. A specific crawler group changes the expected result

Do not test only User-agent: *. This example allows general crawling but blocks Googlebot:

User-agent: *
Allow: /

User-agent: Googlebot
Disallow: /

Always test priority paths with Googlebot, then check any AI crawlers you intentionally manage.

5. Duplicate groups are evaluated incorrectly

Google combines rules from multiple groups that match the same specific user agent. A correct validator or tester must evaluate both rules here:

User-agent: Googlebot
Disallow: /reports/

User-agent: Googlebot
Allow: /reports/public/

Test /reports/public/q2 with Googlebot. The longer Allow rule wins.

6. A wildcard or end-anchor is broader than intended

Patterns can block more than a directory:

User-agent: *
Disallow: /*?preview=
Disallow: /*.pdf$

Create sample paths that should be allowed and blocked, then test both. Regression examples are more reliable than visually reviewing complex patterns.

7. Rendering resources are blocked

Blocking paths containing required JavaScript, CSS, or images can make a public page harder for a search crawler to render correctly. If your framework serves critical assets under a path you disallow, test the relevant resources before deployment.

8. The sitemap line points to the wrong host

Check the sitemap value independently:

Sitemap: https://example.com/sitemap.xml

It should use the canonical host and expose indexable canonical URLs. A syntax-valid robots file cannot fix a broken sitemap.

9. robots.txt is used as an indexing removal tool

robots.txt controls crawling. Google explicitly warns not to rely on it to hide a page from search results. If Google cannot fetch a blocked page, it also cannot observe page-level indexing directives on that page.

Validation workflow

  1. Fetch the live root file for the correct host.
  2. Review it for unexpected global or crawler-specific blocks.
  3. Paste it into the robots.txt Tester.
  4. Test allowed and disallowed examples for Googlebot and managed AI crawlers.
  5. Confirm sitemap discovery.
  6. After publishing a change, review Google's report and request a recrawl when necessary.

References and tools