AgentScan logoAgentScan
Guides7 min read

robots.txt Checker: How to Audit a Live File in 10 Minutes

A practical robots.txt checker workflow for fetch status, rule testing, sitemap discovery, and Google Search Console verification.

robots.txt Checker: How to Audit a Live File in 10 Minutes

Searching for a robots.txt checker usually means a live site may be blocking crawlers or serving an outdated file. An audit needs two separate checks: confirm what the server returns, then evaluate whether important paths are allowed for the crawlers you care about.

Step 1: find the correct file

robots.txt applies per protocol and host. The file for https://www.example.com/products/1 is:

https://www.example.com/robots.txt

It is not inherited from https://example.com/robots.txt, and a staging subdomain has its own file. Audit the exact canonical host shown in your search data.

The live request should return:

  • HTTP 200 when a file is intentionally published.
  • A plain-text body with readable User-agent, Allow, Disallow, and optional Sitemap fields.
  • No HTML error page, application redirect loop, or authentication challenge.

Step 2: copy the live body into a checker

The robots.txt Tester and Checker evaluates pasted rules locally. Copying the response body is deliberate: it lets you test exactly what crawlers receive without granting another service access to your site.

Start with a small matrix:

CrawlerPathExpected result
Googlebot/Allowed for an indexable public site
GooglebotA key landing pageAllowed
GooglebotAn admin or internal pathUsually disallowed
GPTBotA public articleMatches your AI training policy
PerplexityBotA public articleMatches your AI answer-engine policy

An unexpected result deserves investigation before any request for indexing.

Step 3: check the patterns that cause accidental blocks

These rules commonly suppress useful crawling:

User-agent: *
Disallow: /

That blocks every crawler unless a more specific group applies. Another subtle case is a broad folder rule:

User-agent: *
Disallow: /assets/

If indexable pages depend on blocked rendering resources, search engines may receive a poor representation of the page.

Also inspect repeated crawler groups. Under Google's documented behavior, multiple groups for the same specific user agent are combined when evaluating its paths. A checker that reads only the first group can produce a false answer.

Step 4: inspect sitemap discovery

A well-maintained file typically advertises the canonical sitemap:

Sitemap: https://www.example.com/sitemap.xml

Open that URL separately and ensure it lists canonical, crawlable pages. Use the sitemap.xml Validator if you are also troubleshooting discovery or stale lastmod values.

Step 5: verify with Search Console

The Google Search Console robots.txt report shows files Google found for eligible properties, the most recent crawl, and reported warnings or errors. It also allows a recrawl request when an urgent robots.txt correction has been deployed.

Google states that robots.txt manages crawling; it should not be used to hide a page from Google Search. If removal is your objective, use an appropriate indexing control after allowing the crawler to read it.

A repeatable audit checklist

  • Confirm the canonical host and open its live /robots.txt.
  • Confirm successful plain-text delivery.
  • Check homepage, priority landing pages, new posts, and intentionally private paths.
  • Test separate crawler policies rather than assuming User-agent: * applies.
  • Confirm the sitemap URL is correct and crawlable.
  • Review Search Console after deployment.

Use the robots.txt Checker for path verdicts and the robots.txt Generator when you need a clean replacement policy.

Primary reference

For matching and grouping behavior, refer to Google's documentation: How Google interprets the robots.txt specification.