An MCP server that gives an AI agent direct search and retrieval access to a library of PDF textbooks. No vector database, no embeddings, no preprocessing — just raw PDFs and intelligent search.
The agent uses query expansion (generating multiple synonym/related search phrases) to overcome keyword search limitations, then retrieves and synthesizes relevant passages with proper citations.
User Question
│
▼
┌─────────────┐
│ AI Agent │ ← Generates ~5 search phrases from user's question
└─────┬───────┘
│ MCP tool calls
▼
┌─────────────────────┐
│ MCP Server │
│ │
│ Tools: │
│ - search_books() │
│ - list_books() │
│ - list_chapters() │
│ - get_chapter() │
│ - get_section() │
│ - get_page_range() │
│ │
│ Data: /books/*.pdf │
└─────────────────────┘
Returns all available PDF filenames/titles.
Returns the chapter titles and page ranges for a given book (extracted from PDF bookmarks/TOC).
Keyword search across all PDFs using multiple queries simultaneously. Returns snippets (a few sentences of surrounding context) with book name, page number, and location for each hit. This is the primary entry point — the agent generates several synonym/related phrases to cast a wide net.
Returns the full text of a chapter. Use when the agent needs broad context on a topic.
Returns a specific section within a chapter. Use when the agent knows exactly where to look.
Returns the text of a specific page range. Use for grabbing context around a search hit.
- User asks a question
- Query expansion — Agent generates ~5 search terms (synonyms, related phrases, alternate terminology) to handle vocabulary mismatch
- Search — Agent calls
search_books()with all terms - Drill down — Based on snippet results, agent uses
get_section()orget_page_range()to pull full context where needed - Synthesize — Agent combines the retrieved text into an answer
- Cite — Agent quotes relevant passages with book name and page number
User: "How does backpropagation work?"
Agent generates queries:
"backpropagation""backward pass gradient""chain rule neural network""error propagation""gradient computation layers"
Agent searches → finds hits → reads relevant sections → answers with quotes and page citations.
| Decision | Rationale |
|---|---|
| No vector database | Eliminates infrastructure, embedding costs, and preprocessing. PDF search is computationally cheap. |
| No pre-generated summaries | Book titles are sufficient for the agent to know where to search. Keeps the system zero-config. |
| Query expansion over embeddings | Multiple keyword searches cover synonym/terminology gaps without needing semantic similarity. |
| Hierarchical retrieval tools | Mirrors how textbooks are organized (book → chapter → section → page). Lets the agent choose the right granularity. |
| MCP protocol | Any MCP-compatible client (Claude Desktop, custom apps, Claude Code) can connect without custom UI. |
- MCP SDK — Python (
mcp) or TypeScript (@modelcontextprotocol/sdk) - PDF extraction — PyMuPDF (
fitz) for text extraction and keyword search - No other dependencies — no database, no embedding model, no preprocessing pipeline
second-brain/
├── server.py # MCP server with tool definitions
├── pdf_tools.py # PDF search and extraction logic
├── books/ # Drop PDF textbooks here
│ ├── deep_learning.pdf
│ ├── linear_algebra.pdf
│ └── ...
├── requirements.txt # pymupdf, mcp
└── README.md
- Drop PDFs into
books/ - Run the MCP server
- Connect any MCP client
- Ask questions