Skip to content

adityadmore2000/Omni-Doc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Omni-Doc: High-Density PDF Explorer

Omni-Doc is a robust local RAG (Retrieval-Augmented Generation) system engineered to handle massive, high-complexity technical documentation.

While standard AI loaders often fail on large-scale PDFs (3,000+ pages) due to dense tables, JSON schemas, and non-standard formatting, Omni-Doc uses a "Mechanical Split" architecture to guarantee 100% stability.


🛠️ Tech Stack

  • Orchestration: LangGraph (State-machine reasoning)
  • Data Framework: LlamaIndex (Data indexing & retrieval)
  • LLM: Gemma 3:4B via Ollama
  • Embeddings: Nomic-Embed-Text via Ollama
  • Schema: Pydantic for structured, cited responses

🏗️ The Engineering Edge

Omni-Doc is built to bypass the hard constraints of local embedding servers:

  1. Deterministic Splitting: We ignore unreliable "semantic" breaks in favor of strict, fixed-length character chunks. This ensures no single payload ever exceeds the API context limit.
  2. Purified Nodes: By stripping all metadata during the embedding phase, we maximize the available token space for actual content.
  3. Stateful Verification: Using a graph-based workflow, the agent must prove its answers exist within the documentation before responding.

🚀 Getting Started

1. Prerequisites

Ensure Ollama is installed and the models are pulled:

sudo docker container start {name of container}
ollama pull gemma3:4b
ollama pull nomic-embed-text

About

local RAG (Retrieval-Augmented Generation) system engineered to handle massive, high-complexity technical documentation.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages