Skip to content

neoza-labs/docs-indexer

Repository files navigation

Documentation Indexer

A comprehensive documentation indexing service for AI and ops knowledge sources. This service crawls, parses, and indexes documentation from multiple sources including OpenAI, Anthropic, Kubernetes, and custom documentation sites.

Features

  • Multiple Documentation Sources:

    • OpenAI API documentation
    • Anthropic Claude documentation
    • Kubernetes official docs
    • Custom markdown and HTML sources
  • Intelligent Crawling:

    • Respects robots.txt
    • Sitemap support
    • Rate limiting per domain
    • URL deduplication
    • ETag and Last-Modified support
  • Content Processing:

    • HTML and Markdown parsing
    • Code block extraction with language detection
    • Header hierarchy preservation
    • Smart chunking with overlap
    • Link preservation
  • Change Detection:

    • Content hash comparison
    • ETag support
    • Incremental sync capability
  • Database Tracking:

    • PostgreSQL storage
    • Source metadata
    • Last indexed timestamps

Installation

Prerequisites

  • Rust 1.70 or later
  • PostgreSQL 14 or later
  • DATABASE_URL environment variable set

Build

cargo build --release

Configuration

Edit config.yaml to configure sources and settings:

sources:
  openai:
    enabled: true
    base_url: "https://platform.openai.com/docs"
    rate_limit_per_second: 2
    max_pages: 500
    categories:
      - api-reference
      - guides

  anthropic:
    enabled: true
    base_url: "https://docs.anthropic.com"
    rate_limit_per_second: 2
    max_pages: 200

  kubernetes:
    enabled: true
    base_url: "https://kubernetes.io/docs"
    rate_limit_per_second: 1
    max_pages: 1000
    include_sections:
      - concepts
      - tasks
      - reference

database:
  url: "${DATABASE_URL}"

Usage

Initialize Database

docs-indexer init-db

Full Sync

Sync all enabled sources:

docs-indexer full-sync

Sync Specific Source

docs-indexer sync-source openai
docs-indexer sync-source anthropic
docs-indexer sync-source kubernetes

Incremental Sync

Only sync changed documents:

docs-indexer incremental-sync

View Statistics

docs-indexer stats

Add Custom Source

docs-indexer add-custom internal-docs https://docs.company.com --source-type url

Delete Source

docs-indexer delete-source openai

Architecture

Components

  1. Sources (src/sources/):

    • DocumentationSource trait
    • Source-specific implementations (OpenAI, Anthropic, Kubernetes)
    • Custom source support
  2. Crawler (src/crawler/):

    • HTTP fetcher with retry logic
    • Rate limiting
    • Robots.txt checking
    • Sitemap parsing
  3. Parser (src/parser/):

    • HTML parser
    • Markdown parser
    • Code block extraction
    • Content chunking
  4. Database (src/db/):

    • PostgreSQL client
    • Schema management
    • CRUD operations

Database Schema

CREATE TABLE documentation_pages (
    id UUID PRIMARY KEY,
    source VARCHAR(50) NOT NULL,
    url TEXT NOT NULL,
    title TEXT NOT NULL,
    content_hash VARCHAR(64) NOT NULL,
    last_modified TIMESTAMPTZ,
    etag VARCHAR(255),
    last_indexed_at TIMESTAMPTZ NOT NULL,
    metadata JSONB,
    UNIQUE(source, url)
);

Adding Custom Sources

File-based (Markdown)

Add to config.yaml:

custom_sources:
  - name: "internal-runbooks"
    type: "markdown"
    location: "/data/runbooks"
    pattern: "**/*.md"

URL-based

Add to config.yaml:

custom_sources:
  - name: "terraform-docs"
    type: "url"
    base_url: "https://docs.company.com/terraform"

Deployment

Kubernetes CronJob

See helm/docs-indexer/ for Helm chart that includes:

  • CronJob for scheduled syncs
  • ConfigMap for configuration
  • NetworkPolicy for external access
  • Service for health checks

Docker

FROM rust:1.70 as builder
WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bookworm-slim
COPY --from=builder /app/target/release/docs-indexer /usr/local/bin/
CMD ["docs-indexer", "full-sync"]

Security Considerations

  • Rate Limiting: Respects per-domain rate limits
  • Robots.txt: Checks and obeys robots.txt
  • User-Agent: Identifies as "docs-indexer/1.0"
  • HTML Sanitization: Prevents XSS in parsed content
  • URL Validation: Validates URLs before fetching
  • Network Policies: Restricts egress in Kubernetes

Development

Run Tests

cargo test

Run with Logging

RUST_LOG=docs_indexer=debug cargo run -- full-sync

Add New Source

  1. Create source file in src/sources/
  2. Implement DocumentationSource trait
  3. Add configuration to config.yaml
  4. Register in main.rs

Monitoring

Key metrics to track:

  • Pages indexed per source
  • Sync duration
  • Failed fetches
  • Rate limit hits
  • Database errors

Troubleshooting

Rate Limiting Issues

Increase rate limit interval in config:

rate_limit_per_second: 1  # Slower but safer

Memory Issues

Reduce max_pages per source:

max_pages: 100  # Index fewer pages

Database Connection Issues

Check DATABASE_URL environment variable:

export DATABASE_URL="postgresql://user:pass@localhost/docs_indexer"

License

Proprietary - Neoza Labs

Contributing

Internal project - see CONTRIBUTING.md for guidelines.

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors