Multi-Agent RAG: DocSight

DocSight is a multi-agent Retrieval-Augmented Generation (RAG) system built to answer complex, natural language questions across two highly specialized domains: Financial and Medical.

Instead of relying on a single generalized language model, this system intelligently routes user queries to domain-specific expert agents. Each expert is backed by custom fine-tuned LLMs and specialized hybrid retrieval strategies (combining BM25 and dense FAISS vectors) to fetch the most accurate context. A central synthesizer then merges the expert answers and appends citations for full transparency.

Datasets

The system is powered by two distinct, real-world data sources to ground the models' responses in factual literature:

Financial Dataset (SEC EDGAR)
- Content: Comprehensive Annual 10-K reports.
- Scope: Top 10 S&P 500 companies (AAPL, MSFT, GOOG, AMZN, JPM, GS, JNJ, PFE, META, TSLA) covering the years 2022 to 2024.
- Purpose: Allows the Finance Agent to accurately answer queries about company revenue, risk factors, R&D expenditures, and financial performance.
Medical Dataset (PubMed/Entrez)
- Content: Peer-reviewed medical abstracts.
- Scope: Up to 8,000 abstracts retrieved from targeted queries involving Type 2 diabetes treatment clinical trials, cardiovascular disease risk factors, lung cancer immunotherapy outcomes, and antihypertensive therapies.
- Purpose: Empowers the Medical Agent to fetch and cite scientifically sound data regarding drug mechanisms, clinical outcomes, and first-line treatments without hallucinating medical facts.

AI Models Used

The architecture is highly modular, supporting cloud inference and fully local execution.

Retrieval Models (Embeddings)

Finance: BAAI/bge-large-en-v1.5 (1024-dimensional space, optimized for financial text).
Medical: NeuML/pubmedbert-base-embeddings (768-dimensional space, purpose-built for PubMed medical literature).

Generation Models (LLM Backends)

You can switch between three LLM backends using the LLM_BACKEND variable in your .env file. Per-agent overrides (ROUTER_BACKEND, EXPERT_BACKEND, SYNTH_BACKEND, WEB_BACKEND) are available, but the default setup keeps everything on Groq.

1. Groq API (Cloud)

Used for rapid inference to bypass typical generation bottlenecks.

Router Agent: llama-3.1-8b-instant
Finance & Medical Experts: llama-3.1-8b-instant
Synthesizer: llama-3.1-8b-instant

2. Ollama (Local Fine-Tuned GGUFs)

Used for deep domain-specific expertise running entirely on local hardware.

Router Agent: phi4-mini-router (Fine-tuned from Microsoft Phi-4-mini).
Finance Expert: llama31-finance-expert (Fine-tuned from Llama-3.1-8B-Instruct).
Medical Expert: biomistral-medical-expert (Fine-tuned from BioMistral-7B).
Synthesizer: Reuses the llama31-finance-expert.

3. HuggingFace Transformers (Local PyTorch)

Use local or Hub models directly via Transformers.

Router: HF_ROUTER_MODEL
Finance: HF_FINANCE_MODEL
Medical: HF_MEDICAL_MODEL
Synthesizer: HF_SYNTH_MODEL
Web: HF_WEB_MODEL

Architecture

graph TD
    A[User Query] --> B[Router Agent]
    B -->|finance| C[Finance Agent]
    B -->|medical| D[Medical Agent]
    B -->|both| C & D

    C --> E{Sources found?}
    D --> E

    E -->|No| F[Web Agent]
    E -->|Yes| G[Synthesizer]
    F --> G

    G --> H[Final Answer + Sources + Domain Tag]

    B -.- B1["Classifies query: finance / medical / both"]
    C -.- C1["Hybrid BM25 + FAISS retrieval\n+ fine-tuned LLM (Llama-3.1 / BioMistral)"]
    D -.- C1
    F -.- F1["DuckDuckGo search fallback"]
    G -.- G1["Merges domain answers,\nadds citations + confidence"]

Orchestration: LangGraph stateful graph
Vector DB: FAISS (local, no infra needed)
Retrieval Strategy: Hybrid BM25 (40% weight) + Dense FAISS (60% weight)

Quickstart

1. Prerequisites

Docker Desktop installed and running
Optional: A Groq API key if you want Groq-based routing or inference

2. Configure Environment

cp .env.example .env

Edit .env and fill in at minimum:

LLM_BACKEND=ollama
OLLAMA_BASE_URL=http://localhost:11434

# Optional if you want Groq for the router only
# GROQ_API_KEY=gsk_your_groq_key_here
# ROUTER_BACKEND=groq

# Optional if you want local Transformers
# LLM_BACKEND=hf
# HF_FINANCE_MODEL=path_or_hf_id
# HF_MEDICAL_MODEL=path_or_hf_id
# HF_ROUTER_MODEL=path_or_hf_id
# HF_SYNTH_MODEL=path_or_hf_id
# HF_WEB_MODEL=path_or_hf_id

Optional (for LangSmith tracing):

LANGCHAIN_API_KEY=ls__your_key_here
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT=docsight-rag

3. Run with Docker

docker compose up -d --build

The UI will be available at http://localhost:8501

# View logs
docker compose logs -f

# Stop
docker compose down

Running Locally (without Docker)

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate   # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the Streamlit app
streamlit run app/streamlit_app.py

Building FAISS Indexes

The indexes/ directory is pre-built and mounted into the Docker container. To rebuild from scratch (e.g., after adding new documents to the datasets):

# Step 1: Download raw data
python -m src.data_ingestion.sec_downloader
python -m src.data_ingestion.pubmed_downloader

# Step 2: Chunk documents
python -m src.data_ingestion.chunker

# Step 3: Build FAISS indexes
python build_indexes.py

Evaluation

We utilized the RAGAS framework to quantitatively evaluate the faithfulness and relevance of our agents against a synthetic dataset of 100 QA pairs per domain.

Run RAGAS evaluation using a subset sample to adhere to strict API rate limits:

python3 -m src.evaluation.run_ragas --sample 5

Run retrieval ablation testing (BM25 vs dense vs hybrid):

python3 -m src.evaluation.ablation

Final Evaluation Results

The final summary table based on the successful run over the evaluation files:

Domain	Faithfulness	Answer Relevancy	Context Precision	Context Recall
Finance	0.9080	0.2970	0.5000	0.5000
Medical	0.3058	0.7630	0.6000	0.6000

Key Takeaways:

Finance Faithfulness (~91%): The Finance agent is extremely reliable at grounding its answers, with over 90% of its claims directly backed by the retrieved SEC 10-K contexts without hallucinating.
Medical Relevancy (~76%): The Medical agent generates highly relevant answers that directly address the user's medical queries.
Consistent Retrieval Performance (50-60%): Both domains demonstrate solid and balanced retrieval capabilities. The hybrid BM25 + FAISS retrievers are successfully identifying and ranking the most relevant context chunks, achieving 50% Precision/Recall for Finance and 60% for Medical.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
app		app
data/eval		data/eval
notebooks		notebooks
src		src
training		training
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
TEST_QUERIES.md		TEST_QUERIES.md
build_indexes.py		build_indexes.py
config.py		config.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Agent RAG: DocSight

Datasets

AI Models Used

Retrieval Models (Embeddings)

Generation Models (LLM Backends)

1. Groq API (Cloud)

2. Ollama (Local Fine-Tuned GGUFs)

3. HuggingFace Transformers (Local PyTorch)

Architecture

Quickstart

1. Prerequisites

2. Configure Environment

3. Run with Docker

Running Locally (without Docker)

Building FAISS Indexes

Evaluation

Final Evaluation Results

Key Takeaways:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-Agent RAG: DocSight

Datasets

AI Models Used

Retrieval Models (Embeddings)

Generation Models (LLM Backends)

1. Groq API (Cloud)

2. Ollama (Local Fine-Tuned GGUFs)

3. HuggingFace Transformers (Local PyTorch)

Architecture

Quickstart

1. Prerequisites

2. Configure Environment

3. Run with Docker

Running Locally (without Docker)

Building FAISS Indexes

Evaluation

Final Evaluation Results

Key Takeaways:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages