Skip to content

sftrkr/sitepulse

Repository files navigation

sitepulse

sitepulse is a Rust-based CLI and MCP-enabled site intelligence tool for technical SEO, sitemap health checks, and AI agent readiness audits.

It discovers URLs from a sitemap.xml, checks each page's HTTP status, response time, redirect state, final URL, and optional metadata, then produces terminal, CSV, JSON, HTML, JUnit, and SARIF reports. It also includes an --agent-ready audit inspired by emerging agent-web standards such as llms.txt, AI crawler rules, discovery headers, protocol discovery, structured data, DNS-AID, and agentic commerce signals.

For AI-native workflows, sitepulse can run as a local Model Context Protocol (MCP) server via sitepulse mcp, allowing Codex-compatible apps and other MCP clients to call sitemap checks, agent readiness audits, and config validation as structured tools.

The project is designed for WordPress, WooCommerce, e-commerce, publisher, and SaaS websites that need to detect broken links, 404/500 errors, redirect issues, slow pages, metadata gaps, and whether the site is ready for AI agents and crawlers.

Status

The first working version has been implemented.

Current features:

  • sitepulse check <SITEMAP_URL> command
  • sitepulse mcp command for Model Context Protocol integrations
  • Standard sitemap parsing
  • Sitemap index support
  • Gzip sitemap support (.xml.gz)
  • Maximum sitemap index depth: 2
  • Extract URLs from <loc>...</loc> entries
  • Deduplicate repeated URLs
  • HTTP status code reporting
  • Response time measurement
  • Redirect following
  • Final URL reporting
  • Timeout support
  • Custom User-Agent support
  • Concurrency support
  • Per-request delay support for politeness/rate limiting
  • Option to show only errors
  • Summary-only output option
  • Retry support for network errors and 5xx responses
  • GET/HEAD check method selection
  • Optional title, meta description, and canonical URL extraction
  • Same-host filtering option
  • Optional robots.txt filtering
  • Initial agent readiness audit (--agent-ready)
  • CI-friendly agent readiness score threshold
  • Maximum URL limit option
  • Dry-run discovery mode
  • CSV export
  • JSON export
  • HTML report export
  • CI-friendly non-zero exit option
  • Summary report
  • Top 10 slowest URLs
  • Custom User-Agent
sitepulse/0.1 (+https://example.local)

Installation

Requirements:

  • Rust stable
  • Cargo

Build the project:

cargo build

Build a release binary:

cargo build --release

Generated binary:

./target/release/sitepulse

Usage

Basic usage:

cargo run -- check https://example.com/sitemap.xml

Using the compiled binary:

sitepulse check https://example.com/sitemap.xml

CLI options

sitepulse check <SITEMAP_URL> [OPTIONS]
sitepulse config validate <FILE>

Options:

Option Description Default
--config <FILE> Load check options from a JSON config file None
--concurrency <N> Number of concurrent HTTP checks 10
--delay-ms <MS> Delay before each URL check request in milliseconds 0
--timeout <SECONDS> Request timeout in seconds 10
--user-agent <VALUE> Custom User-Agent for all HTTP requests sitepulse/0.1 (+https://example.local)
--method <METHOD> HTTP method for URL checks: get or head get
--analyze-meta Extract page title, meta description, and canonical URL. Uses GET even with --method=head Disabled
--only-errors Show only network errors and 4xx/5xx responses Disabled
--summary-only Print only the summary, without the per-URL result table Disabled
--export <FILE> Write results to a CSV file None
--export-json <FILE> Write results to a JSON file None
--export-html <FILE> Write an HTML report None
--export-junit <FILE> Write URL check results as JUnit XML for CI systems None
--export-sarif <FILE> Write URL check findings as SARIF for code scanning systems None
--fail-on-errors Exit with code 2 if any 4xx, 5xx, timeout, or network error is found Disabled
--retries <N> Retry failed URL checks and 5xx responses 0
--sitemap-retries <N> Retry sitemap downloads before failing 2
--max-urls <N> Limit how many discovered URLs are checked None
--dry-run Discover and filter URLs without running HTTP checks Disabled
--same-host-only Only check URLs whose host matches the sitemap URL host Disabled
--respect-robots Filter out URLs disallowed by robots.txt Disabled
--agent-ready Run an agent readiness audit for the sitemap host Disabled
--agent-ready-export-json <FILE> Write agent readiness results to a JSON file None
--agent-ready-export-html <FILE> Write agent readiness results to an HTML file None
--agent-ready-fail-under <PERCENT> Exit with code 3 if agent readiness score is below the threshold None

Examples:

cargo run -- check https://example.com/sitemap.xml --concurrency 20
cargo run -- check https://example.com/sitemap.xml --timeout 15
cargo run -- check https://example.com/sitemap.xml --method head
cargo run -- check https://example.com/sitemap.xml --analyze-meta
cargo run -- check https://example.com/sitemap.xml --only-errors
cargo run -- check https://example.com/sitemap.xml --export report.csv
cargo run -- check https://example.com/sitemap.xml --retries 2
cargo run -- check https://example.com/sitemap.xml --max-urls 100
cargo run -- check https://example.com/sitemap.xml --same-host-only
cargo run -- check https://example.com/sitemap.xml --respect-robots
cargo run -- check https://example.com/sitemap.xml --agent-ready
cargo run -- check https://example.com/sitemap.xml --sitemap-retries 3
cargo run -- check https://example.com/sitemap.xml \
  --agent-ready \
  --agent-ready-export-json agent-ready.json \
  --agent-ready-export-html agent-ready.html \
  --agent-ready-fail-under 80

Multiple options can be used together:

cargo run -- check https://example.com/sitemap.xml \
  --concurrency 20 \
  --timeout 10 \
  --method head \
  --analyze-meta \
  --retries 2 \
  --sitemap-retries 3 \
  --max-urls 1000 \
  --same-host-only \
  --respect-robots \
  --only-errors \
  --export report.csv \
  --export-json report.json \
  --export-html report.html \
  --agent-ready \
  --agent-ready-export-json agent-ready.json \
  --agent-ready-export-html agent-ready.html

Example terminal output

Checking sitemap: https://example.com/sitemap.xml
Concurrency: 20
Timeout: 10s
User-Agent: sitepulse/0.1 (+https://example.local)
Method: HEAD
Analyze meta: yes
Retries: 2
Sitemap retries: 2

Discovered URLs: 1240

STATUS      TIME ATTEMPTS  METHOD  REDIRECT    ERROR URL
------------------------------------------------------------------------------------------
200        184ms        1     HEAD        no       no https://example.com/
301         96ms        1     HEAD       yes       no https://example.com/old -> https://example.com/new
404        121ms        1     HEAD        no       no https://example.com/missing-page
500        430ms        3     HEAD        no       no https://example.com/broken

Summary:
Total: 1240
2xx: 1190
3xx: 22
4xx: 20
5xx: 4
Errors: 4
Average response time: 218ms

Slowest URLs:
1. 3820ms https://example.com/category/electronics
2. 2910ms https://example.com/product/example

Export

Export to CSV:

cargo run -- check https://example.com/sitemap.xml --export report.csv

Export to JSON:

cargo run -- check https://example.com/sitemap.xml --export-json report.json

Export to HTML:

cargo run -- check https://example.com/sitemap.xml --export-html report.html

CSV, JSON, and HTML result fields include:

  • url
  • status
  • time_ms
  • redirected
  • final_url
  • error
  • attempts
  • method
  • title
  • meta_description
  • canonical_url

Project structure

src/
  main.rs      # Application entry point
  cli.rs       # CLI arguments and command definitions
  sitemap.rs   # Sitemap download, parsing, and discovery
  checker.rs   # URL HTTP checks
  report.rs    # Terminal output and summary report
  export.rs    # CSV, JSON, and HTML export
  models.rs    # Shared data models

examples/
  sitemap.xml  # Example sitemap for testing

Configuration file

--config accepts a JSON file with check options. Example:

{
  "concurrency": 5,
  "timeout": 15,
  "method": "head",
  "analyze_meta": true,
  "same_host_only": true,
  "respect_robots": true,
  "agent_ready": true,
  "agent_ready_fail_under": 70
}

Command-line options are parsed first, then config values are applied. For repeated audits, keep shared defaults in a config file and pass target-specific values such as the sitemap URL on the command line.

Development

Format code:

cargo fmt

Run compile checks:

cargo check

Run tests:

cargo test

Roadmap

Completed:

  • Project skeleton

  • Cargo.toml

  • CLI command

  • Sitemap download

  • URL parsing

  • HTTP checks

  • Concurrency

  • Per-request delay support for politeness/rate limiting

  • Timeout

  • Custom User-Agent support

  • --only-errors

  • --summary-only

  • Retry support

  • Sitemap download retry support

  • GET/HEAD check method selection

  • Optional title, meta description, and canonical URL extraction

  • Same-host filtering option

  • Optional robots.txt filtering

  • Initial agent readiness audit (--agent-ready)

  • CI-friendly agent readiness score threshold

  • Maximum URL limit option

  • Dry-run discovery mode

  • CSV export

  • JSON export

  • HTML report export

  • CI-friendly --fail-on-errors option

  • Sitemap index support

  • Gzip sitemap support

  • Slow URL list

  • README

  • Integration tests with a local HTTP server

  • Expanded agent readiness audit (--agent-ready)

    • Discoverability checks: robots.txt, sitemap directives, Link headers, DNS-AID
    • Content accessibility checks: llms.txt, llms-full.txt, Markdown negotiation
    • Bot access control checks: AI bot rules, allow/block detection, Content Signals, Web Bot Auth
    • Protocol discovery checks: MCP, Agent Skills, WebMCP, A2A, API catalog, OAuth, auth.md
    • Page intelligence checks: title, meta description, canonical URL, OpenGraph, JSON-LD, semantic HTML
    • Commerce readiness checks: x402, MPP, UCP, ACP
    • Scoring/reporting: score, PASS/WARN/FAIL checklist, JSON/HTML exports
  • Add GitHub release workflow for tagged binary releases

  • Automated versioning with release-plz

  • Add configuration file support for repeated audits

  • Add basic per-request politeness delay

  • Add JUnit and SARIF CI exports

  • Richer structured data validation for JSON-LD schema types

  • Per-host concurrency controls

  • Add Homebrew tap formula draft

  • Add v0.1.0 release notes draft

  • Add advanced per-host rate window controls

  • Publish GitHub release notes and binaries for v0.1.0

  • Publish prebuilt release binaries

Potential next improvements:

Notes

  • HTTP errors do not crash the program; they are reported per URL.
  • If the sitemap cannot be downloaded or the XML is invalid, the program returns a clear error.
  • Redirects are followed and the final URL is recorded.
  • Duplicate URLs are deduplicated.

License

This project is licensed under the MIT License. See LICENSE for details.

Contributing

Please see CONTRIBUTING.md for development setup, validation commands, and pull request guidelines.

Security

Please see SECURITY.md for vulnerability reporting guidelines.

Changelog

Please see CHANGELOG.md for release history.

Additional documentation

Versioning automation

Versioning is automated with release-plz: https://release-plz.ieni.dev/. On pushes to main, the Release PR workflow analyzes conventional commits, updates Cargo.toml and CHANGELOG.md, and opens or updates a release pull request.

Recommended commit prefixes:

  • feat: for new features
  • fix: for bug fixes
  • perf: for performance improvements
  • docs: for documentation-only changes
  • refactor: for internal changes
  • ci: for CI changes

When the release PR is merged, release-plz can create the Git tag and GitHub Release. The existing Release workflow then builds and uploads prebuilt binaries for that tag.