Core documentation indexing and search functionality for Model Context Protocol (MCP) servers. This package provides powerful tools for crawling, indexing, and searching documentation websites with both keyword and semantic search capabilities.
- Dual Search Providers - FlexSearch for fast keyword search, Vectra for semantic understanding
- Intelligent Content Detection - Automatically detects optimal CSS selectors using heuristics and Mozilla Readability
- Advanced Chunking Strategies - Traditional, semantic, and Late Chunking approaches
- Document-Centric Optimization - Smart result grouping for coherent, context-aware responses
- Flexible Crawling - Supports both recursive crawling and single-page extraction
- MCP Integration - Seamless integration with Model Context Protocol servers
- TypeScript Support - Full type safety with comprehensive TypeScript definitions
pnpm add @agentdesk/mcp-docsimport { createIndex } from "@agentdesk/mcp-docs";
await createIndex({
pages: ["https://docs.example.com"],
selectors: {
links: 'a[href^="/docs"]',
content: "article.prose",
},
outputFile: "docs-index.json",
});import { detectContentSelectors } from "@agentdesk/mcp-docs";
const result = await detectContentSelectors({
url: "https://docs.example.com",
requireLinks: true,
});
console.log("Detected selectors:", result);
// { contentSelector: "main", linkSelector: "a[href^='/docs']", confidence: 0.9 }import { KnowledgeBase, getModuleDir } from "@agentdesk/mcp-docs";
const docs = new KnowledgeBase({
path: getModuleDir(import.meta.url), // Directory containing index
apiKey: process.env.OPENAI_API_KEY, // For Vectra indices
});
const results = await docs.search({
query: "authentication methods",
tokenLimit: 10000,
});The createIndex function orchestrates a sophisticated pipeline to transform live documentation websites into a clean, searchable index:
sequenceDiagram
participant createIndex
participant Heuristics as Content Detection
participant Pipeline as Document Pipeline
participant Crawler as Playwright Crawler
participant Parser as Content Parser
participant Chunker as Chunking Service
participant Provider as Search Provider
participant Embeddings as Embedding Service
createIndex->>Heuristics: Auto-detect selectors
Heuristics-->>createIndex: Optimal CSS selectors
createIndex->>Pipeline: Extract documents
Pipeline->>Crawler: Crawl URLs with selectors
Crawler-->>Pipeline: Raw HTML pages
Pipeline->>Parser: Clean & convert to Markdown
Parser-->>Pipeline: Clean content
Pipeline->>Chunker: Split into semantic chunks
Chunker-->>Pipeline: Document chunks with metadata
Pipeline-->>createIndex: Processed documents
createIndex->>Provider: Initialize search provider
alt Vectra Provider
Provider->>Embeddings: Generate embeddings for chunks
Embeddings-->>Provider: Vector embeddings
Provider->>Provider: Store in vector database
else FlexSearch Provider
Provider->>Provider: Build keyword index
end
Provider-->>createIndex: Index saved to disk
- Content Detection: Uses heuristics to automatically detect optimal CSS selectors for content extraction
- Document Extraction: Crawls pages using Playwright, extracts clean content using Mozilla Readability
- Content Processing: Converts HTML to Markdown and splits into semantic chunks
- Indexing: Creates searchable index using either FlexSearch (keyword) or Vectra (semantic)
- Optimization: Applies document-centric optimization for coherent search results
Algorithm: Inverted index with forward tokenization and contextual matching
Configuration:
{
type: "flexsearch",
indexOptions: {
charset: "latin:default",
tokenize: "forward",
resolution: 9, // Search depth
context: {
resolution: 3, // Context resolution
depth: 1, // Context depth
bidirectional: true, // Bidirectional context
},
},
}How it works:
- Tokenization: Splits text into forward tokens for efficient matching
- Context Building: Creates bidirectional context maps for related terms
- Scoring: Uses position-aware scoring with context weighting
- Storage: Simple JSON format with document content and search metadata
Best for: Fast exact matches, technical documentation, API references
Algorithm: Vector similarity search with Late Chunking and contextual embeddings
Configuration:
{
type: "vectra",
embeddings: {
provider: "openai",
model: "text-embedding-ada-002",
apiKey: process.env.OPENAI_API_KEY,
},
indexOptions: {
metadataFields: ["url", "title"],
metric: "cosine", // Similarity metric
},
chunking: {
strategy: "late-chunking",
useCase: "documentation",
chunkSize: 1024, // Target chunk size
chunkOverlap: 128, // Overlap between chunks
maxChunkSize: 1600, // Maximum chunk size
minChunkSize: 200, // Minimum chunk size
},
}How it works:
- Late Chunking: Processes full document context before chunking to preserve relationships
- Embedding Generation: Creates dense vector representations using OpenAI embeddings
- Vector Storage: Uses Vectra's local vector database for similarity search
- Semantic Matching: Finds conceptually related content even without exact term matches
Best for: Conceptual queries, large knowledge bases, natural language understanding
Our Late Chunking strategy preserves contextual information that traditional chunking loses:
Document: "Authentication can be done via JWT tokens. JWT tokens contain user claims..."
Chunk 1: "Authentication can be done via JWT tokens."
Chunk 2: "JWT tokens contain user claims..."
❌ Problem: Chunk 2 loses context about what JWT tokens are for
1. Process full document context: "Authentication via JWT → User claims → Security..."
2. Generate contextual embeddings with full document awareness
3. Split into chunks while preserving semantic relationships
4. Each chunk retains understanding of its role in the larger context
✅ Result: Each chunk understands its context within the larger document
The ChunkingService implements multiple strategies:
export interface ChunkingConfig {
strategy: "traditional" | "late-chunking" | "semantic" | "sentence";
chunkSize: number; // Target size in tokens
chunkOverlap: number; // Overlap for context preservation
minChunkSize?: number; // Discard smaller chunks
maxChunkSize?: number; // Force split larger chunks
useContextualEmbeddings?: boolean; // Enable Late Chunking
semanticSeparators?: string[]; // Custom separators
}Late Chunking Algorithm:
- Document Analysis: Parse full document structure and identify semantic boundaries
- Context Mapping: Build relationship map between sections
- Embedding Generation: Create embeddings with full document context
- Smart Splitting: Split along semantic boundaries while preserving context
- Overlap Optimization: Calculate optimal overlap to maintain coherence
The SearchOptimizer transforms scattered search results into coherent, context-rich responses:
Raw search results:
- Chunk 1: "JWT authentication requires..." (doc A)
- Chunk 2: "User permissions are..." (doc A)
- Chunk 3: "Token validation steps..." (doc A)
- Chunk 4: "Authentication overview..." (doc B)
export interface OptimizationOptions {
tokenBudget: number; // Available token budget
fullDocumentThreshold: number; // Min chunks for full document
expandedChunkMultiplier: number; // Chunk expansion factor
targetUtilization: number; // Target budget utilization
}Optimization Strategies:
When: 3+ highly relevant chunks from same document Action: Return entire document instead of fragments Benefit: Complete context and natural flow
When: Related chunks clustered in document section Action: Expand to include surrounding context Benefit: Preserves local context and readability
When: Scattered, unrelated results Action: Return individual chunks as fallback Benefit: Ensures most relevant information is included
class SearchOptimizer {
async optimizeResults(rawResults: SearchResult[]): Promise<{
optimizedResults: OptimizedResult[];
stats: OptimizationStats;
}> {
// 1. Group results by document
const documentGroups = this.groupByDocument(rawResults);
// 2. Calculate relevance scores
const scoredGroups = this.calculateRelevanceScores(documentGroups);
// 3. Apply optimization strategy
const strategy = this.selectOptimizationStrategy(scoredGroups);
// 4. Generate optimized results
const optimized = await this.applyStrategy(strategy, scoredGroups);
// 5. Ensure token budget compliance
return this.fitToBudget(optimized);
}
}Creates a searchable index from documentation websites.
Parameters:
pages: Array of URLs or page configurationsselectors: CSS selectors for content and link extractionoutputFile: Path where index will be savedprovider: Search provider configuration (FlexSearch or Vectra)crawler: Crawler options (concurrency, delays, retries)content: Content filtering (include/exclude patterns)
Example:
await createIndex({
pages: [
{
url: "https://docs.example.com",
mode: "crawl",
selectors: {
links: 'a[href^="/docs"]',
content: "main article",
},
},
{
url: "https://api.example.com/reference",
mode: "single-page",
waitForSelector: ".api-loaded",
},
],
provider: {
type: "vectra",
embeddings: {
provider: "openai",
model: "text-embedding-ada-002",
apiKey: process.env.OPENAI_API_KEY,
},
chunking: {
strategy: "late-chunking",
chunkSize: 1024,
chunkOverlap: 128,
},
},
outputFile: "docs-vectra-index",
crawler: {
maxRequestsPerCrawl: 500,
maxConcurrency: 8,
sameDomainDelaySecs: 1,
},
content: {
excludePatterns: ["/admin", "/_next"],
includePatterns: ["/docs/", "/guides/"],
},
});Automatically detects optimal CSS selectors for content extraction.
Parameters:
url: Website URL to analyzehtml: Optional HTML content (if not provided, fetches from URL)requireLinks: Whether to detect link selectors for crawling
Returns:
{
contentSelector: string; // Best content selector
linkSelector?: string; // Best link selector (if requireLinks=true)
confidence: number; // Confidence score (0-1)
fallbacks: string[]; // Alternative selectors
}Algorithm:
- Mozilla Readability: Attempts content extraction using battle-tested algorithm
- Semantic Analysis: Looks for semantic HTML5 elements (
<main>,<article>) - Pattern Recognition: Recognizes common documentation site patterns
- Validation: Tests selectors against actual content
- Confidence Scoring: Rates reliability based on content quality and structure
Provides search capabilities over indexed documentation.
Constructor:
new KnowledgeBase({
path: string; // Directory containing index files
apiKey?: string; // OpenAI API key (required for Vectra)
optimization?: { // Search optimization options
tokenBudget: number;
fullDocumentThreshold: number;
targetUtilization: number;
};
})Methods:
Searches documentation with document-centric optimization.
Parameters:
{
query: string; // Search query
tokenLimit?: number; // Maximum tokens to return (default: 10000)
}Returns: Formatted string containing optimized search results with:
- Result summaries and metadata
- Document-centric optimized content
- Optimization statistics and insights
- Token utilization information
Gets directory path for ES modules (replacement for __dirname).
import { getModuleDir } from "@agentdesk/mcp-docs";
const currentDir = getModuleDir(import.meta.url);Parses command-line arguments to extract configuration.
import { getConfig } from "@agentdesk/mcp-docs";
const { apiKey } = getConfig(process.argv);{
pages: [
{
url: "https://docs.example.com",
mode: "crawl",
selectors: {
links: 'nav a[href^="/docs"]',
content: "main article",
},
crawler: {
maxConcurrency: 2, // Override global setting
sameDomainDelaySecs: 2, // Slower crawling for this site
},
content: {
excludePatterns: ["/changelog"], // Site-specific exclusions
turndownOptions: { // Custom Markdown conversion
headingStyle: "atx",
codeBlockStyle: "fenced",
},
},
},
],
}{
content: {
excludePatterns: ["/admin", "/internal", "/_next"],
includePatterns: ["/docs/", "/guides/", "/api/"],
turndownOptions: {
headingStyle: "atx", // # Heading style
codeBlockStyle: "fenced", // ```code``` style
linkStyle: "inlined", // [text](url) style
emDelimiter: "*", // *emphasis* style
},
},
}{
crawler: {
maxRequestsPerCrawl: 500, // Total page limit
maxConcurrency: 8, // Concurrent requests
minConcurrency: 1, // Minimum concurrency
maxRequestRetries: 3, // Retry failed requests
maxRequestsPerMinute: 120, // Rate limiting
navigationTimeoutSecs: 30, // Page load timeout
requestHandlerTimeoutSecs: 60, // Handler timeout
sameDomainDelaySecs: 1, // Delay between same-domain requests
headless: true, // Run browser headlessly
retryOnBlocked: true, // Retry if blocked
},
}Comprehensive TypeScript definitions for all interfaces:
// Core configuration types
export type IndexerConfig = z.infer<typeof IndexerConfigSchema>;
export type PageConfig = z.infer<typeof PageConfigSchema>;
export type SelectorConfig = z.infer<typeof SelectorSchema>;
export type CrawlerConfig = z.infer<typeof CrawlerConfigSchema>;
export type ContentConfig = z.infer<typeof ContentConfigSchema>;
// Search provider types
export interface ISearchProvider {
readonly type: string;
initialize(config: ProviderConfig): Promise<void>;
createIndex(documents: Doc[], outputPath: string): Promise<void>;
loadIndex(indexPath: string): Promise<ISearchIndex>;
getDefaultConfig(): ProviderConfig;
}
export interface ISearchIndex {
search(options: SearchOptions): Promise<SearchResult[]>;
getStats(): Promise<IndexStats>;
}
// Search optimization types
export interface OptimizedResult {
url: string;
content: string;
type: "full_document" | "expanded_chunk" | "chunk";
relevanceScore: number;
tokenCount: number;
chunksFound: number;
}
export interface OptimizationStats {
strategy: string;
originalTokens: number;
optimizedTokens: number;
utilization: number;
documentsProcessed: number;
}- Crawlee - Robust web scraping framework with Playwright
- FlexSearch - Fast full-text search engine
- Vectra - Local vector database for embeddings
- Mozilla Readability - Content extraction algorithm
- Playwright - Browser automation for dynamic content
- Turndown - HTML to Markdown conversion
- JSDOM - DOM manipulation for content processing
- Zod - Runtime type validation
- Index Size: ~10-20% of original content
- Search Speed: <10ms for most queries
- Memory Usage: Low, suitable for serverless
- Best Use: <10k pages, exact term matching
- Index Size: ~50-100% of original content (includes vectors)
- Search Speed: ~50-200ms depending on corpus size
- Memory Usage: Higher due to vector operations
- Best Use: >1k pages, semantic understanding
- Token Efficiency: 70-90% improvement in relevant content density
- Context Preservation: 95%+ of document relationships maintained
- Response Coherence: 3-5x improvement in readability scores
- Store OpenAI API keys in environment variables
- Use minimal permission API keys when possible
- Implement rate limiting for production deployments
- Use
excludePatternsto prevent indexing sensitive content - Validate all CSS selectors to prevent injection
- Sanitize all extracted content before processing
- Configure appropriate request timeouts
- Implement retry limits to prevent infinite loops
- Use headless browsing for security isolation
This package is part of the AgentDesk MCP documentation system. Contributions welcome!
git clone https://github.com/agentdesk/create-mcp-docs
cd create-mcp-docs/packages/mcp-docs
pnpm install
pnpm devpnpm test # Run all tests
pnpm test:unit # Unit tests only
pnpm test:integration # Integration tests onlycreate-mcp-docs- CLI tool for generating MCP documentation servers@modelcontextprotocol/sdk- MCP SDK for TypeScript
MIT - See LICENSE file for details.