Skip to content

MiguelMasc/PDF-RAG

Repository files navigation

Second Brain — PDF Textbook Search via MCP Server

Overview

An MCP server that gives an AI agent direct search and retrieval access to a library of PDF textbooks. No vector database, no embeddings, no preprocessing — just raw PDFs and intelligent search.

The agent uses query expansion (generating multiple synonym/related search phrases) to overcome keyword search limitations, then retrieves and synthesizes relevant passages with proper citations.


Architecture

User Question
      │
      ▼
┌─────────────┐
│   AI Agent   │  ← Generates ~5 search phrases from user's question
└─────┬───────┘
      │ MCP tool calls
      ▼
┌─────────────────────┐
│   MCP Server        │
│                     │
│  Tools:             │
│  - search_books()   │
│  - list_books()     │
│  - list_chapters()  │
│  - get_chapter()    │
│  - get_section()    │
│  - get_page_range() │
│                     │
│  Data: /books/*.pdf │
└─────────────────────┘

MCP Tools

list_books()

Returns all available PDF filenames/titles.

list_chapters(book: string)

Returns the chapter titles and page ranges for a given book (extracted from PDF bookmarks/TOC).

search_books(queries: string[])

Keyword search across all PDFs using multiple queries simultaneously. Returns snippets (a few sentences of surrounding context) with book name, page number, and location for each hit. This is the primary entry point — the agent generates several synonym/related phrases to cast a wide net.

get_chapter(book: string, chapter: string | int)

Returns the full text of a chapter. Use when the agent needs broad context on a topic.

get_section(book: string, chapter: string | int, section: string | int)

Returns a specific section within a chapter. Use when the agent knows exactly where to look.

get_page_range(book: string, start_page: int, end_page: int)

Returns the text of a specific page range. Use for grabbing context around a search hit.


Agent Workflow

  1. User asks a question
  2. Query expansion — Agent generates ~5 search terms (synonyms, related phrases, alternate terminology) to handle vocabulary mismatch
  3. Search — Agent calls search_books() with all terms
  4. Drill down — Based on snippet results, agent uses get_section() or get_page_range() to pull full context where needed
  5. Synthesize — Agent combines the retrieved text into an answer
  6. Cite — Agent quotes relevant passages with book name and page number

Example

User: "How does backpropagation work?"

Agent generates queries:

  • "backpropagation"
  • "backward pass gradient"
  • "chain rule neural network"
  • "error propagation"
  • "gradient computation layers"

Agent searches → finds hits → reads relevant sections → answers with quotes and page citations.


Design Decisions

Decision Rationale
No vector database Eliminates infrastructure, embedding costs, and preprocessing. PDF search is computationally cheap.
No pre-generated summaries Book titles are sufficient for the agent to know where to search. Keeps the system zero-config.
Query expansion over embeddings Multiple keyword searches cover synonym/terminology gaps without needing semantic similarity.
Hierarchical retrieval tools Mirrors how textbooks are organized (book → chapter → section → page). Lets the agent choose the right granularity.
MCP protocol Any MCP-compatible client (Claude Desktop, custom apps, Claude Code) can connect without custom UI.

Tech Stack

  • MCP SDK — Python (mcp) or TypeScript (@modelcontextprotocol/sdk)
  • PDF extraction — PyMuPDF (fitz) for text extraction and keyword search
  • No other dependencies — no database, no embedding model, no preprocessing pipeline

File Structure

second-brain/
├── server.py          # MCP server with tool definitions
├── pdf_tools.py       # PDF search and extraction logic
├── books/             # Drop PDF textbooks here
│   ├── deep_learning.pdf
│   ├── linear_algebra.pdf
│   └── ...
├── requirements.txt   # pymupdf, mcp
└── README.md

Getting Started

  1. Drop PDFs into books/
  2. Run the MCP server
  3. Connect any MCP client
  4. Ask questions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors