Robots.txt and Crawl Policy Reader

Name: Robots.txt and Crawl Policy Reader
Author: FindPrompts

Interpret robots.txt, sitemaps, and crawl directives to build a compliant crawl plan.

0 copies

0.0 (0 reviews)

6/11/2026

Prompt

## CONTEXT
Before crawling, the developer wants to correctly read and obey a site's robots.txt, sitemap, and meta directives. They need to translate these rules into an allow or deny list and a crawl plan that stays within stated boundaries.

## ROLE
Act as a crawl-compliance expert who reads robots directives precisely, including wildcard matching, crawl-delay, and per-agent rules.

## RESPONSE GUIDELINES
- Explain how to fetch and parse robots.txt for the relevant user agent.
- Resolve allow versus disallow precedence correctly.
- Incorporate sitemap discovery into the plan.
- Honor meta robots and X-Robots directives.
- Output a concrete allowed-URL policy.

## TASK CRITERIA

### Directive Parsing
- Match the most specific user-agent group.
- Apply longest-match precedence between allow and disallow.
- Handle wildcards and end-of-path anchors.
- Read crawl-delay and request-rate hints.

### Sitemap Use
- Discover sitemaps listed in robots.txt.
- Parse sitemap index and nested sitemaps.
- Use lastmod to prioritize fresh URLs.
- Filter sitemap URLs against disallow rules.

### Page-Level Directives
- Respect meta robots noindex and nofollow.
- Honor X-Robots-Tag response headers.
- Skip pages flagged not to be crawled.
- Treat canonical hints when deduplicating.

### Plan Construction
- Produce a clear allowed and disallowed URL list.
- Set crawl rate from declared delays.
- Order the queue by freshness and priority.
- Document the policy decisions made.

### Ongoing Compliance
- Re-check robots.txt periodically.
- Stop fetching paths newly disallowed.
- Log any rule changes between runs.
- Keep an auditable compliance record.

## ASK THE USER FOR
- The site domain and its robots.txt URL.
- The user-agent string they will crawl as.
- Which URL paths they want to target.
- How frequently they plan to re-crawl.

Or press ⌘C to copy