A comprehensive documentation indexing service for AI and ops knowledge sources. This service crawls, parses, and indexes documentation from multiple sources including OpenAI, Anthropic, Kubernetes, and custom documentation sites.
-
Multiple Documentation Sources:
- OpenAI API documentation
- Anthropic Claude documentation
- Kubernetes official docs
- Custom markdown and HTML sources
-
Intelligent Crawling:
- Respects robots.txt
- Sitemap support
- Rate limiting per domain
- URL deduplication
- ETag and Last-Modified support
-
Content Processing:
- HTML and Markdown parsing
- Code block extraction with language detection
- Header hierarchy preservation
- Smart chunking with overlap
- Link preservation
-
Change Detection:
- Content hash comparison
- ETag support
- Incremental sync capability
-
Database Tracking:
- PostgreSQL storage
- Source metadata
- Last indexed timestamps
- Rust 1.70 or later
- PostgreSQL 14 or later
DATABASE_URLenvironment variable set
cargo build --releaseEdit config.yaml to configure sources and settings:
sources:
openai:
enabled: true
base_url: "https://platform.openai.com/docs"
rate_limit_per_second: 2
max_pages: 500
categories:
- api-reference
- guides
anthropic:
enabled: true
base_url: "https://docs.anthropic.com"
rate_limit_per_second: 2
max_pages: 200
kubernetes:
enabled: true
base_url: "https://kubernetes.io/docs"
rate_limit_per_second: 1
max_pages: 1000
include_sections:
- concepts
- tasks
- reference
database:
url: "${DATABASE_URL}"docs-indexer init-dbSync all enabled sources:
docs-indexer full-syncdocs-indexer sync-source openai
docs-indexer sync-source anthropic
docs-indexer sync-source kubernetesOnly sync changed documents:
docs-indexer incremental-syncdocs-indexer statsdocs-indexer add-custom internal-docs https://docs.company.com --source-type urldocs-indexer delete-source openai-
Sources (
src/sources/):DocumentationSourcetrait- Source-specific implementations (OpenAI, Anthropic, Kubernetes)
- Custom source support
-
Crawler (
src/crawler/):- HTTP fetcher with retry logic
- Rate limiting
- Robots.txt checking
- Sitemap parsing
-
Parser (
src/parser/):- HTML parser
- Markdown parser
- Code block extraction
- Content chunking
-
Database (
src/db/):- PostgreSQL client
- Schema management
- CRUD operations
CREATE TABLE documentation_pages (
id UUID PRIMARY KEY,
source VARCHAR(50) NOT NULL,
url TEXT NOT NULL,
title TEXT NOT NULL,
content_hash VARCHAR(64) NOT NULL,
last_modified TIMESTAMPTZ,
etag VARCHAR(255),
last_indexed_at TIMESTAMPTZ NOT NULL,
metadata JSONB,
UNIQUE(source, url)
);Add to config.yaml:
custom_sources:
- name: "internal-runbooks"
type: "markdown"
location: "/data/runbooks"
pattern: "**/*.md"Add to config.yaml:
custom_sources:
- name: "terraform-docs"
type: "url"
base_url: "https://docs.company.com/terraform"See helm/docs-indexer/ for Helm chart that includes:
- CronJob for scheduled syncs
- ConfigMap for configuration
- NetworkPolicy for external access
- Service for health checks
FROM rust:1.70 as builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bookworm-slim
COPY --from=builder /app/target/release/docs-indexer /usr/local/bin/
CMD ["docs-indexer", "full-sync"]- Rate Limiting: Respects per-domain rate limits
- Robots.txt: Checks and obeys robots.txt
- User-Agent: Identifies as "docs-indexer/1.0"
- HTML Sanitization: Prevents XSS in parsed content
- URL Validation: Validates URLs before fetching
- Network Policies: Restricts egress in Kubernetes
cargo testRUST_LOG=docs_indexer=debug cargo run -- full-sync- Create source file in
src/sources/ - Implement
DocumentationSourcetrait - Add configuration to
config.yaml - Register in
main.rs
Key metrics to track:
- Pages indexed per source
- Sync duration
- Failed fetches
- Rate limit hits
- Database errors
Increase rate limit interval in config:
rate_limit_per_second: 1 # Slower but saferReduce max_pages per source:
max_pages: 100 # Index fewer pagesCheck DATABASE_URL environment variable:
export DATABASE_URL="postgresql://user:pass@localhost/docs_indexer"Proprietary - Neoza Labs
Internal project - see CONTRIBUTING.md for guidelines.