Searching for a robots.txt checker usually means a live site may be blocking crawlers or serving an outdated file. An audit needs two separate checks: confirm what the server returns, then evaluate whether important paths are allowed for the crawlers you care about.
Step 1: find the correct file
robots.txt applies per protocol and host. The file for https://www.example.com/products/1 is:
https://www.example.com/robots.txtIt is not inherited from https://example.com/robots.txt, and a staging subdomain has its own file. Audit the exact canonical host shown in your search data.
The live request should return:
- HTTP
200when a file is intentionally published. - A plain-text body with readable
User-agent,Allow,Disallow, and optionalSitemapfields. - No HTML error page, application redirect loop, or authentication challenge.
Step 2: copy the live body into a checker
The robots.txt Tester and Checker evaluates pasted rules locally. Copying the response body is deliberate: it lets you test exactly what crawlers receive without granting another service access to your site.
Start with a small matrix:
| Crawler | Path | Expected result |
|---|---|---|
Googlebot | / | Allowed for an indexable public site |
Googlebot | A key landing page | Allowed |
Googlebot | An admin or internal path | Usually disallowed |
GPTBot | A public article | Matches your AI training policy |
PerplexityBot | A public article | Matches your AI answer-engine policy |
An unexpected result deserves investigation before any request for indexing.
Step 3: check the patterns that cause accidental blocks
These rules commonly suppress useful crawling:
User-agent: *
Disallow: /That blocks every crawler unless a more specific group applies. Another subtle case is a broad folder rule:
User-agent: *
Disallow: /assets/If indexable pages depend on blocked rendering resources, search engines may receive a poor representation of the page.
Also inspect repeated crawler groups. Under Google's documented behavior, multiple groups for the same specific user agent are combined when evaluating its paths. A checker that reads only the first group can produce a false answer.
Step 4: inspect sitemap discovery
A well-maintained file typically advertises the canonical sitemap:
Sitemap: https://www.example.com/sitemap.xmlOpen that URL separately and ensure it lists canonical, crawlable pages. Use the sitemap.xml Validator if you are also troubleshooting discovery or stale lastmod values.
Step 5: verify with Search Console
The Google Search Console robots.txt report shows files Google found for eligible properties, the most recent crawl, and reported warnings or errors. It also allows a recrawl request when an urgent robots.txt correction has been deployed.
Google states that robots.txt manages crawling; it should not be used to hide a page from Google Search. If removal is your objective, use an appropriate indexing control after allowing the crawler to read it.
A repeatable audit checklist
- Confirm the canonical host and open its live
/robots.txt. - Confirm successful plain-text delivery.
- Check homepage, priority landing pages, new posts, and intentionally private paths.
- Test separate crawler policies rather than assuming
User-agent: *applies. - Confirm the sitemap URL is correct and crawlable.
- Review Search Console after deployment.
Use the robots.txt Checker for path verdicts and the robots.txt Generator when you need a clean replacement policy.
Primary reference
For matching and grouping behavior, refer to Google's documentation: How Google interprets the robots.txt specification.
