Documentation Indexer

A comprehensive documentation indexing service for AI and ops knowledge sources. This service crawls, parses, and indexes documentation from multiple sources including OpenAI, Anthropic, Kubernetes, and custom documentation sites.

Features

Multiple Documentation Sources:
- OpenAI API documentation
- Anthropic Claude documentation
- Kubernetes official docs
- Custom markdown and HTML sources
Intelligent Crawling:
- Respects robots.txt
- Sitemap support
- Rate limiting per domain
- URL deduplication
- ETag and Last-Modified support
Content Processing:
- HTML and Markdown parsing
- Code block extraction with language detection
- Header hierarchy preservation
- Smart chunking with overlap
- Link preservation
Change Detection:
- Content hash comparison
- ETag support
- Incremental sync capability
Database Tracking:
- PostgreSQL storage
- Source metadata
- Last indexed timestamps

Installation

Prerequisites

Rust 1.70 or later
PostgreSQL 14 or later
DATABASE_URL environment variable set

Build

cargo build --release

Configuration

Edit config.yaml to configure sources and settings:

sources:
  openai:
    enabled: true
    base_url: "https://platform.openai.com/docs"
    rate_limit_per_second: 2
    max_pages: 500
    categories:
      - api-reference
      - guides

  anthropic:
    enabled: true
    base_url: "https://docs.anthropic.com"
    rate_limit_per_second: 2
    max_pages: 200

  kubernetes:
    enabled: true
    base_url: "https://kubernetes.io/docs"
    rate_limit_per_second: 1
    max_pages: 1000
    include_sections:
      - concepts
      - tasks
      - reference

database:
  url: "${DATABASE_URL}"

Usage

Initialize Database

docs-indexer init-db

Full Sync

Sync all enabled sources:

docs-indexer full-sync

Sync Specific Source

docs-indexer sync-source openai
docs-indexer sync-source anthropic
docs-indexer sync-source kubernetes

Incremental Sync

Only sync changed documents:

docs-indexer incremental-sync

View Statistics

docs-indexer stats

Add Custom Source

docs-indexer add-custom internal-docs https://docs.company.com --source-type url

Delete Source

docs-indexer delete-source openai

Architecture

Components

Sources (src/sources/):
- DocumentationSource trait
- Source-specific implementations (OpenAI, Anthropic, Kubernetes)
- Custom source support
Crawler (src/crawler/):
- HTTP fetcher with retry logic
- Rate limiting
- Robots.txt checking
- Sitemap parsing
Parser (src/parser/):
- HTML parser
- Markdown parser
- Code block extraction
- Content chunking
Database (src/db/):
- PostgreSQL client
- Schema management
- CRUD operations

Database Schema

CREATE TABLE documentation_pages (
    id UUID PRIMARY KEY,
    source VARCHAR(50) NOT NULL,
    url TEXT NOT NULL,
    title TEXT NOT NULL,
    content_hash VARCHAR(64) NOT NULL,
    last_modified TIMESTAMPTZ,
    etag VARCHAR(255),
    last_indexed_at TIMESTAMPTZ NOT NULL,
    metadata JSONB,
    UNIQUE(source, url)
);

Adding Custom Sources

File-based (Markdown)

Add to config.yaml:

custom_sources:
  - name: "internal-runbooks"
    type: "markdown"
    location: "/data/runbooks"
    pattern: "**/*.md"

URL-based

Add to config.yaml:

custom_sources:
  - name: "terraform-docs"
    type: "url"
    base_url: "https://docs.company.com/terraform"

Deployment

Kubernetes CronJob

See helm/docs-indexer/ for Helm chart that includes:

CronJob for scheduled syncs
ConfigMap for configuration
NetworkPolicy for external access
Service for health checks

Docker

FROM rust:1.70 as builder
WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/docs-indexer /usr/local/bin/
CMD ["docs-indexer", "full-sync"]

Security Considerations

Rate Limiting: Respects per-domain rate limits
Robots.txt: Checks and obeys robots.txt
User-Agent: Identifies as "docs-indexer/1.0"
HTML Sanitization: Prevents XSS in parsed content
URL Validation: Validates URLs before fetching
Network Policies: Restricts egress in Kubernetes

Development

Run Tests

cargo test

Run with Logging

RUST_LOG=docs_indexer=debug cargo run -- full-sync

Add New Source

Create source file in src/sources/
Implement DocumentationSource trait
Add configuration to config.yaml
Register in main.rs

Monitoring

Key metrics to track:

Pages indexed per source
Sync duration
Failed fetches
Rate limit hits
Database errors

Troubleshooting

Rate Limiting Issues

Increase rate limit interval in config:

rate_limit_per_second: 1  # Slower but safer

Memory Issues

Reduce max_pages per source:

max_pages: 100  # Index fewer pages

Database Connection Issues

Check DATABASE_URL environment variable:

export DATABASE_URL="postgresql://user:pass@localhost/docs_indexer"

License

Proprietary - Neoza Labs

Contributing

Internal project - see CONTRIBUTING.md for guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
migrations		migrations
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Makefile		Makefile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SOURCES.md		SOURCES.md
config.yaml		config.yaml

Folders and files

Latest commit

History

Repository files navigation

Documentation Indexer

Features

Installation

Prerequisites

Build

Configuration

Usage

Initialize Database

Full Sync

Sync Specific Source

Incremental Sync

View Statistics

Add Custom Source

Delete Source

Architecture

Components

Database Schema

Adding Custom Sources

File-based (Markdown)

URL-based

Deployment

Kubernetes CronJob

Docker

Security Considerations

Development

Run Tests

Run with Logging

Add New Source

Monitoring

Troubleshooting

Rate Limiting Issues

Memory Issues

Database Connection Issues

License

Contributing

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages