🔍 RAG Auditor

Know if your RAG is production-ready before your users find out it isn't.

RAG Ecosystem

This repo is part of a broader RAG toolkit:

Repo	What it covers
rag-auditor ← you are here	Evaluate your RAG pipeline
multi-llm-rag-agent-chat	Build a production RAG chatbot with multi-LLM routing
rag-system-design-guide	Design reference — architecture patterns and trade-offs

Start with the design guide, build with the chatbot, evaluate with this.

RAG Ecosystem
The Problem
What It Does
Demo
Key Features
Built On
Quickstart
Architecture
Step-by-Step Testing Guide
- Step 1 — Set Up Your Environment
- Step 2 — Start the Backend
- Step 3 — Generate a Synthetic Golden Dataset
- Step 4 — Evaluate a Single RAG Response
- Step 5 — Evaluate a Batch
- Step 6 — Compare Two Evaluations
- Step 7 — Run the Automated Test Suite
Understanding Verdicts
RAGAS Metrics Explained
Interpreting Recommendations
How LLM-as-Judge Works
Integration: Evaluating multi-llm-rag-agent-chat
Key Design Decisions
Extending the System
API Reference
Key Files
Configuration
Contributing

The Problem

Most RAG systems ship broken.

Not broken in obvious ways — broken in the ways that matter: answers that sound confident but contradict the source documents, retrieved chunks that miss the point entirely, prompts that quietly hallucinate under edge cases.

The teams building these systems usually know something's off. They just have no way to measure it systematically — so they eyeball outputs, cross their fingers, and ship.

RAG Auditor fixes that.

What It Does

RAG Auditor is an open source evaluation platform that automatically scores your RAG pipeline across the four dimensions that predict real-world quality — then tells you exactly what to fix and how.

Input:  Your question  +  Retrieved context chunks  +  RAG-generated answer
Output: A production-readiness verdict with scored diagnostics and fix recommendations

Evaluation Dimensions

Metric	What It Measures	Why It Matters
Faithfulness	Is the answer grounded in the retrieved context?	Catches hallucinations
Answer Relevancy	Does the answer actually address the question?	Catches non-answers
Context Precision	Are retrieved chunks signal or noise?	Catches retrieval bloat
Context Recall	Did retrieval surface the right information?	Catches retrieval gaps
Hallucination Risk	`LOW` / `MEDIUM` / `HIGH` classification	Human-readable safety check

Each score comes with a plain-English explanation and a specific, actionable recommendation.

Demo

⚡ Paste a question, context, and answer. Get a verdict in ~10 seconds.

Question:    "What is our refund policy for digital products?"
Context:     [3 retrieved chunks from your knowledge base]
Answer:      "You can get a refund within 30 days, no questions asked."

──────────────────────────────────────────────
  Overall Score      0.84    ● NEEDS WORK
──────────────────────────────────────────────
  Faithfulness       0.91    ✓ Strong
  Answer Relevancy   0.88    ✓ Strong
  Context Precision  0.67    ⚠ Warning
  Context Recall     0.79    ● Review
  Hallucination Risk  LOW    ✓ Safe
──────────────────────────────────────────────
  Top Issue: Context precision is low — your retriever is pulling in
  irrelevant chunks alongside the relevant ones. Try reducing top-k
  from 5 to 3, or add a reranking step before generation.
──────────────────────────────────────────────

Key Features

Single-sample evaluator — paste one Q/context/answer, get instant scores
Batch evaluation — upload CSV/JSON for bulk pipeline testing
Golden dataset generator — paste your source docs, auto-generate synthetic Q&A test pairs (the #1 reason teams skip evals is they have no dataset — this removes that blocker entirely)
Compare mode — run before/after evals when you change chunking, top-k, or prompts; see exact delta per metric
Evaluation history — save results in the current browser session and restore them into the evaluator to compare runs or keep iterating
RAG trace visualization — see scores annotated at each stage: Query → Retrieval → Prompt Construction → Generation → Answer
LLM-as-judge — Claude evaluates hallucination risk with reasoning, not just a number
Fix recommendations — every low score maps to a specific, actionable suggestion

Built On

RAG Auditor is a product layer built on top of battle-tested open source infrastructure:

RAGAS — the leading RAG evaluation framework, providing the core faithfulness, relevancy, precision, and recall metrics
Claude (claude-sonnet-4-6) — LLM-as-judge for hallucination detection and plain-English explanations
FastAPI — async Python backend with SSE streaming
React + Recharts — dashboard UI

RAGAS provides the scoring science. Claude provides the reasoning layer. RAG Auditor provides the product experience that makes both usable without a PhD.

Quickstart

Docker (recommended):

git clone https://github.com/yourusername/rag-auditor
cd rag-auditor
cp .env.example .env          # Add your ANTHROPIC_API_KEY
docker-compose up

Open http://localhost:3000 — that's it.

Local development:

# Backend
cd backend
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
uvicorn main:app --reload --port 8000

# Frontend (new terminal)
cd frontend && npm install && npm run dev

Architecture

┌─────────────────────────────────────────────┐
│                  React UI                   │
│  Evaluator · Batch · Generator · Compare    │
└──────────────────┬──────────────────────────┘
                   │ REST + SSE
┌──────────────────▼──────────────────────────┐
│               FastAPI Backend               │
│                                             │
│  ┌─────────────┐    ┌────────────────────┐  │
│  │    RAGAS    │    │   Claude (Judge)   │  │
│  │  Evaluator  │    │  Hallucination     │  │
│  │             │    │  Detection +       │  │
│  │ Faithfulness│    │  Recommendations   │  │
│  │ Relevancy   │    │  Plain-English     │  │
│  │ Precision   │    │  Explanations      │  │
│  │ Recall      │    └────────────────────┘  │
│  └─────────────┘                            │
│         └──────────── asyncio.gather() ─────┘
│                                             │
│  ┌──────────────────────────────────────┐   │
│  │     Recommendation Engine            │   │
│  │  Score → Root Cause → Specific Fix   │   │
│  └──────────────────────────────────────┘   │
└─────────────────────────────────────────────┘

Step-by-Step Testing Guide

This section walks you through the full workflow: generating a dataset, evaluating it, and running automated tests.

Prerequisites

Python 3.11+
Node.js 18+
An Anthropic API key

Step 1 — Set Up Your Environment

cd backend
python -m venv venv
source venv/bin/activate          # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env

Edit .env and set:

ANTHROPIC_API_KEY=sk-ant-...
CORS_ORIGINS=http://localhost:3000

Step 2 — Start the Backend

cd backend
uvicorn main:app --reload --port 8000

Verify it's running:

curl http://localhost:8000/health
# → {"status": "ok"}

Interactive API docs are available at http://localhost:8000/docs

Step 3 — Generate a Synthetic Golden Dataset

Before you can evaluate your RAG system, you need Q&A pairs with ground truth. Use the Dataset Generator to create them from your own source documents.

Via the UI

Open http://localhost:3000
Click Dataset Generator in the nav
Paste 1–5 source documents (e.g., product docs, knowledge-base articles)
Set the number of Q&A pairs (1–100)
Click Generate
Download as JSON or CSV

CSV exports store contexts as a JSON array string so they round-trip cleanly into the Batch Evaluator. The Batch Evaluator also accepts the older pipe-delimited contexts format for compatibility.

Via curl

curl -X POST http://localhost:8000/generate-dataset \
  -H "Content-Type: application/json" \
  -d '{
    "documents": [
      "RAG stands for Retrieval-Augmented Generation. It is a technique that combines information retrieval with large language models. The retrieval step fetches relevant documents from a knowledge base. The generation step uses the retrieved documents as context to produce an answer.",
      "Faithfulness measures whether every claim in the generated answer can be traced back to the retrieved context. A faithfulness score of 1.0 means the answer is fully grounded in the provided documents."
    ],
    "num_questions": 5
  }'

Response structure:

{
  "pairs": [
    {
      "question": "What does RAG stand for?",
      "answer": "RAG stands for Retrieval-Augmented Generation.",
      "ground_truth": "RAG stands for Retrieval-Augmented Generation.",
      "contexts": ["RAG stands for Retrieval-Augmented Generation..."],
      "evolution_type": "simple"
    }
  ],
  "total": 5,
  "source_documents": 2
}

The generator first attempts RAGAS TestsetGenerator (which creates diverse question types: simple, reasoning, multi-context). If RAGAS is unavailable it falls back to Claude directly. Either path produces the same output format.

Save the dataset for reuse

curl -X POST http://localhost:8000/generate-dataset \
  -H "Content-Type: application/json" \
  -d '{"documents": ["...your text..."], "num_questions": 10}' \
  -o my_dataset.json

Dataset evolution types

Type	Description	Distribution
`simple`	Direct factual questions from one document	50%
`reasoning`	Questions requiring inference or multi-step thinking	30%
`multi_context`	Questions that require combining multiple documents	20%

Each pair contains:

question — the user query to pose to your RAG system
answer — a reference answer (useful for comparison)
ground_truth — the canonical correct answer (used for context recall)
contexts — the source passages (use these as your "retrieved chunks" in evaluation)
evolution_type — question category

Tip: To evaluate your own RAG system with the generated dataset, replace contexts with the chunks your retriever actually returns, and replace answer with what your LLM generates. Keep ground_truth as-is.

Step 4 — Evaluate a Single RAG Response

Use a Q&A pair from your dataset (or write one manually) and evaluate it.

Via the UI

Open http://localhost:3000
Fill in Question, Answer, Retrieved Contexts (one per line), and optionally Ground Truth
Choose mode: Full (all metrics) or Quick (skips context recall even if Ground Truth is provided, faster)
Click Evaluate
View per-metric scores, hallucination badge, trace visualization, and recommendations

Via curl (non-streaming)

curl -X POST http://localhost:8000/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is RAG?",
    "answer": "RAG stands for Retrieval-Augmented Generation. It retrieves relevant documents and uses them to generate answers.",
    "contexts": [
      "RAG stands for Retrieval-Augmented Generation. It is a technique that combines information retrieval with large language models.",
      "The retrieval step fetches relevant documents. The generation step uses them as context."
    ],
    "ground_truth": "RAG retrieves relevant documents from a knowledge base and uses them as context to generate accurate answers.",
    "mode": "full"
  }'

Response structure:

{
  "overall_score": 0.87,
  "scores": {
    "faithfulness": 0.95,
    "answer_relevancy": 0.88,
    "context_precision": 0.80,
    "context_recall": 0.75,
    "hallucination_risk": "low"
  },
  "trace": {
    "retrieval_stage": {"score": 0.775, "issues": []},
    "generation_stage": {"score": 0.915, "issues": []}
  },
  "recommendations": [...],
  "verdict": "READY",
  "explanation": "Your RAG pipeline is well-grounded and relevant..."
}

Via curl (streaming SSE)

The streaming endpoint yields real-time progress events followed by the final result:

curl -N -X POST http://localhost:8000/evaluate/stream \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is RAG?",
    "answer": "RAG stands for Retrieval-Augmented Generation.",
    "contexts": ["RAG stands for Retrieval-Augmented Generation..."],
    "ground_truth": "RAG retrieves documents and generates answers.",
    "mode": "full"
  }'

Events emitted (one per line, data: {...}):

{"type": "progress", "message": "Initializing evaluation engine...", "step": 0, "total": 5}
{"type": "progress", "message": "Checking answer faithfulness...", "step": 1, "total": 5}
{"type": "progress", "message": "Running hallucination check..."}
{"type": "scores", "scores": {...}}
{"type": "result", "data": {...full EvaluationResponse...}}
[DONE]

Step 5 — Evaluate a Batch

Use this when you have a full dataset and want aggregate statistics.

In the UI, batch upload accepts JSON files directly and CSV files with contexts stored either as a JSON array string or as a legacy pipe-delimited field.

curl -X POST http://localhost:8000/evaluate/batch \
  -H "Content-Type: application/json" \
  -d '{
    "samples": [
      {
        "question": "What is RAG?",
        "answer": "RAG is Retrieval-Augmented Generation.",
        "contexts": ["RAG combines retrieval with generation."],
        "ground_truth": "RAG retrieves documents to generate answers.",
        "mode": "full"
      },
      {
        "question": "What does faithfulness measure?",
        "answer": "Faithfulness measures if claims trace back to context.",
        "contexts": ["Faithfulness measures whether every claim in the answer can be traced back to context."],
        "ground_truth": "Faithfulness measures grounding of the answer in retrieved context.",
        "mode": "full"
      }
    ]
  }'

Response includes:

{
  "aggregate": {
    "faithfulness": 0.91,
    "answer_relevancy": 0.85,
    "context_precision": 0.78,
    "context_recall": 0.72,
    "overall_score": 0.85
  },
  "verdict_distribution": {"READY": 2},
  "total_samples": 2,
  "successful": 2,
  "failed": 0,
  "results": [...],
  "errors": []
}

Using a generated dataset directly

# Generate
curl -X POST http://localhost:8000/generate-dataset \
  -H "Content-Type: application/json" \
  -d '{"documents": ["..."], "num_questions": 5}' \
  -o dataset.json

# Transform to batch format with jq
cat dataset.json | jq '{samples: [.pairs[] | {question, answer, contexts, ground_truth, mode: "full"}]}' \
  > batch_input.json

# Evaluate
curl -X POST http://localhost:8000/evaluate/batch \
  -H "Content-Type: application/json" \
  -d @batch_input.json

Step 6 — Compare Two Evaluations

After changing your RAG pipeline, compare before and after:

# Save your baseline result
curl -X POST http://localhost:8000/evaluate \
  -H "Content-Type: application/json" \
  -d '{"question": "...", "answer": "old answer", "contexts": [...], "mode": "full"}' \
  -o baseline.json

# Save your candidate result
curl -X POST http://localhost:8000/evaluate \
  -H "Content-Type: application/json" \
  -d '{"question": "...", "answer": "improved answer", "contexts": [...], "mode": "full"}' \
  -o candidate.json

# Compare
curl -X POST http://localhost:8000/evaluate/compare \
  -H "Content-Type: application/json" \
  -d "{\"baseline\": $(cat baseline.json), \"candidate\": $(cat candidate.json)}"

Response:

{
  "deltas": [
    {"metric": "faithfulness", "baseline": 0.70, "candidate": 0.90, "delta": 0.20, "direction": "improved"},
    {"metric": "answer_relevancy", "baseline": 0.85, "candidate": 0.82, "delta": -0.03, "direction": "regressed"}
  ],
  "summary": "Mixed results: faithfulness improved by 20.0% but answer_relevancy regressed by 3.0%.",
  "overall_direction": "mixed"
}

Step 7 — Run the Automated Test Suite

Unit tests cover formatters, trace analysis, and recommendation logic — no API key required.

cd backend
source venv/bin/activate
pytest tests/ -v

Expected output:

tests/test_evaluate.py::TestFormatters::test_clamp_score_valid PASSED
tests/test_evaluate.py::TestFormatters::test_compute_overall_score_all_metrics PASSED
tests/test_evaluate.py::TestFormatters::test_verdict_ready PASSED
tests/test_evaluate.py::TestFormatters::test_verdict_not_ready_high_hallucination PASSED
tests/test_evaluate.py::TestRecommendations::test_critical_faithfulness PASSED
tests/test_evaluate.py::TestRecommendations::test_recommendations_sorted_by_severity PASSED
tests/test_evaluate.py::TestTraceAnalyzer::test_analyze_trace_with_all_scores PASSED
...

Run frontend tests:

cd frontend
npm test

Understanding Verdicts

Verdict	Condition
READY	Overall score ≥ 0.80 AND hallucination risk is not `high`
NEEDS_WORK	Overall score ≥ 0.60 (and hallucination not `high`)
NOT_READY	Overall score < 0.60, OR hallucination risk is `high`

The overall score is a weighted average:

Metric	Weight
Faithfulness	35%
Answer Relevancy	30%
Context Precision	20%
Context Recall	15%

RAGAS Metrics Explained

Metric	What it measures	Requires ground truth?
Faithfulness	Does every claim in the answer trace back to the retrieved context? A score of 1.0 means the answer is fully grounded.	No
Answer Relevancy	Does the answer actually address the question asked? Low scores mean the answer is off-topic.	No
Context Precision	What fraction of the retrieved chunks are actually relevant? Low scores mean your retriever is returning noise.	No
Context Recall	Was all the relevant information present in the retrieved chunks? Low scores mean your retriever missed important content.	Yes

Hallucination Risk is an additional LLM-as-judge assessment (not from RAGAS) that classifies whether the answer introduces information not in the context: low, medium, or high.

Interpreting Recommendations

Every evaluation returns prioritized recommendations sorted by severity:

Severity	Score range	Example fix
critical	< 0.50	Rewrite your generation prompt to forbid external knowledge
warning	0.50–0.70	Tighten top-k, add reranking
info	≥ 0.80	Monitor production metrics

How LLM-as-Judge Works

RAG Auditor uses Claude (claude-sonnet-4-6) for three purposes:

RAGAS metrics — Claude is the judge LLM for all RAGAS computations (faithfulness, answer relevancy, context precision, context recall)
Hallucination detection — A custom Claude prompt analyzes whether the answer introduces unsupported claims, returning risk_level, confidence, unsupported_claims, and rationale
Plain-English explanation — Claude synthesizes scores into a 2–3 sentence summary of what to fix

Integration: Evaluating multi-llm-rag-agent-chat

multi-llm-rag-agent-chat is a production RAG chatbot with dual-LLM routing (GPT-4o / Gemini Flash), ChromaDB vector storage, HuggingFace embeddings (all-MiniLM-L6-v2), and an RLHF feedback loop. RAG Auditor is the ideal complement — it gives you objective metric scores for every dimension of that pipeline.

multi-llm-rag-agent-chat          RAG Auditor
─────────────────────────         ─────────────────────────────────
Upload documents          ──►     Generate golden dataset from same docs
Ask question              ──►     Capture question + answer + contexts
ChromaDB retrieval (top 6)──►     Evaluate context_precision / context_recall
GPT-4o or Gemini answer   ──►     Evaluate faithfulness / answer_relevancy
RLHF re-ranking active    ──►     Re-run batch eval, compare delta scores

Step 1 — Generate a Golden Dataset from Your Documents

Use the same documents you uploaded to the chatbot to generate ground-truth Q&A pairs:

curl -X POST http://localhost:8000/generate-dataset \
  -H "Content-Type: application/json" \
  -d '{
    "documents": [
      "paste the text content of one of your uploaded PDFs or web pages here",
      "paste a second document here"
    ],
    "num_questions": 20
  }' \
  -o golden_dataset.json

These Q&A pairs become your evaluation harness. The ground_truth field is what you use to score context recall.

Step 2 — Capture Live Responses from the Chatbot

For each question in your golden dataset, query the chatbot and capture the full response including the retrieved source chunks. The chatbot's chat endpoint accepts { "query": "...", "session_id": "..." } and returns answer + sources.

Note on content truncation: The chatbot's sources[].content field is currently truncated to 300 characters (doc.page_content[:300] in chat.py). This is fine for UI display but too short for RAGAS to compute accurate faithfulness and recall scores. See Changes needed to multi-llm-rag-agent-chat below for the one-line fix.

import httpx, json, uuid

golden = json.load(open("golden_dataset.json"))
samples = []
session_id = str(uuid.uuid4())

for pair in golden["pairs"]:
    # Query the chatbot backend — field is "query", not "message"
    resp = httpx.post(
        "http://localhost:8001/chat",
        json={"query": pair["question"], "session_id": session_id},
        timeout=60,
    )
    data = resp.json()

    samples.append({
        "question": pair["question"],
        "answer": data["answer"],
        # sources[].content is truncated to 300 chars by default — apply the
        # full_content fix (see below) to get meaningful RAGAS scores
        "contexts": [s["content"] for s in data["sources"]],
        "ground_truth": pair["ground_truth"],
        "mode": "full",
    })

json.dump({"samples": samples}, open("batch_input.json", "w"))

Changes needed to multi-llm-rag-agent-chat

Only one change is required in the chatbot to make it fully compatible with RAG Auditor evaluation.

Problem: backend/routers/chat.py truncates source content to 300 chars:

# current — too short for RAGAS
Source(content=doc.page_content[:300], ...)

Fix: Return the full chunk content (or add a full_content field alongside the truncated preview):

# option A — return full content (evaluation-friendly, slightly larger payload)
Source(content=doc.page_content, ...)

# option B — keep the 300-char preview for UI, add full content for eval
Source(
    content=doc.page_content[:300],   # UI display
    full_content=doc.page_content,    # evaluation use
    ...
)

If you go with option B, update the integration script to use s["full_content"] instead of s["content"].

No other changes are required — the chatbot's API shape (query, answer, sources, chunk_ids, llm_used, complexity_score) maps cleanly to RAG Auditor's evaluation input.

Step 3 — Run Batch Evaluation

curl -X POST http://localhost:8000/evaluate/batch \
  -H "Content-Type: application/json" \
  -d @batch_input.json \
  -o batch_results.json

# Quick summary
cat batch_results.json | jq '.aggregate'

This gives you aggregate scores across your whole document corpus — exactly what you need to decide if the pipeline is production-ready.

Step 4 — Compare GPT-4o vs Gemini Routing

The chatbot routes queries above complexity threshold 0.4 to GPT-4o and below to Gemini Flash. Use RAG Auditor's compare mode to measure whether the routing decision actually improves quality:

import httpx, json

question = "What are the key architectural trade-offs in microservices?"  # high complexity
contexts = ["...retrieved chunks..."]
ground_truth = "..."

# Force GPT-4o answer (or capture from a high-complexity query)
gpt4o_eval = httpx.post("http://localhost:8000/evaluate", json={
    "question": question,
    "answer": "GPT-4o generated answer here",
    "contexts": contexts,
    "ground_truth": ground_truth,
    "mode": "full"
}).json()

# Capture Gemini answer (low-complexity routing)
gemini_eval = httpx.post("http://localhost:8000/evaluate", json={
    "question": question,
    "answer": "Gemini Flash generated answer here",
    "contexts": contexts,
    "ground_truth": ground_truth,
    "mode": "full"
}).json()

# Compare
compare = httpx.post("http://localhost:8000/evaluate/compare", json={
    "baseline": gemini_eval,
    "candidate": gpt4o_eval
}).json()

print(compare["summary"])
# e.g. "GPT-4o improved faithfulness by 12.0% and answer_relevancy by 8.0%."

This tells you whether the routing threshold (0.4) is correctly placed — if GPT-4o isn't consistently outscoring Gemini on hard questions, you may need to adjust the threshold.

Step 5 — Measure RLHF Improvement Over Time

The chatbot's RLHF loop re-ranks ChromaDB results based on user thumbs up/down. To measure whether feedback is actually improving retrieval quality:

# Baseline: evaluate before users have given feedback
curl -X POST http://localhost:8000/evaluate/batch \
  -H "Content-Type: application/json" \
  -d @batch_input.json \
  -o before_rlhf.json

# ... let users interact with the chatbot and submit feedback ...

# Re-run: same questions, same contexts, re-capture from chatbot
curl -X POST http://localhost:8000/evaluate/batch \
  -H "Content-Type: application/json" \
  -d @batch_input_after.json \
  -o after_rlhf.json

# Compare aggregate context_precision scores
cat before_rlhf.json | jq '.aggregate.context_precision'
cat after_rlhf.json  | jq '.aggregate.context_precision'

An increase in context_precision after RLHF feedback confirms that re-ranking is surfacing higher-signal chunks. An increase in context_recall confirms fewer relevant chunks are being missed.

What to Watch For

Metric	What it reveals about the chatbot
context_precision	Whether ChromaDB's cosine similarity retrieval (top 6 → top 3) is pulling in noise
context_recall	Whether `all-MiniLM-L6-v2` embeddings capture the semantic meaning of your domain
faithfulness	Whether GPT-4o / Gemini is staying grounded or hallucinating beyond retrieved chunks
answer_relevancy	Whether the complexity router is selecting the right LLM for each query type
hallucination_risk	Claude's independent assessment — useful as a cross-check on the routing decision

Expected baseline: all-MiniLM-L6-v2 is a lightweight embedding model optimized for speed, not domain accuracy. If context_recall scores below 0.70 consistently, consider upgrading to a larger embedding model (e.g. BAAI/bge-large-en-v1.5) and re-running the batch eval to measure the improvement.

Key Design Decisions

1. RAGAS + Claude in combination, not either/or

RAGAS provides statistically rigorous, reproducible metrics based on dataset science. Claude provides contextual reasoning that RAGAS cannot — specifically hallucination detection and plain-English explanations. Running both in parallel via asyncio.gather() means neither adds latency to the other.

2. Weighted overall score, not a simple average

Faithfulness (35%) is weighted highest because hallucinating content is the most damaging RAG failure mode. Answer relevancy (30%) is second because an off-topic answer is equally useless regardless of how well it's grounded. Context metrics are weighted lower (20%/15%) because they diagnose the retriever, which is fixable without touching the LLM.

3. SSE streaming over polling

The /evaluate/stream endpoint emits progress events per metric so the UI can update in real time as each RAGAS metric completes. This avoids a blank "loading" state during what can be a 10–30 second evaluation.

4. Three-tier verdict, not a score

READY / NEEDS_WORK / NOT_READY gives developers and stakeholders a clear go/no-go signal without needing to interpret a float. Hallucination risk overrides the score: even a 0.95 overall score is NOT_READY if Claude classifies hallucination as high.

5. RAGAS → Claude fallback for dataset generation

The dataset generator first attempts RAGAS TestsetGenerator (which produces richer, more diverse question types using multi-hop reasoning). If RAGAS is unavailable or fails, it falls back to a direct Claude prompt that produces the same JSON schema. The caller never needs to know which path ran.

6. Recommendations sorted by severity, not metric

Critical issues (score < 0.50) surface first regardless of which metric produced them. This matches how a developer would triage — fix the worst thing first, then warnings, then informational.

Extending the System

Swap the LLM Judge

All Claude calls are isolated in backend/services/llm_judge.py. To use a different model, change the model parameter:

# llm_judge.py
response = await client.messages.create(
    model="claude-sonnet-4-6",   # change this
    ...
)

To use a different provider entirely, replace the anthropic.AsyncAnthropic client with any async client that accepts the same prompt structure.

Add a New RAGAS Metric

In backend/services/ragas_evaluator.py, add your metric to metrics_config in stream_ragas_evaluation() and to _run_ragas_sync():

from ragas.metrics import answer_correctness   # example new metric

metrics_config = [
    ...
    ("answer_correctness", "Checking answer correctness..."),
]

Then add it to the Scores model in backend/models/evaluation.py and the weighting dict in backend/utils/formatters.py.

Change the Verdict Thresholds

Edit score_to_verdict() in backend/utils/formatters.py:

def score_to_verdict(overall_score: float, hallucination_risk) -> str:
    if hallucination_risk == "high":
        return "NOT_READY"
    if overall_score >= 0.85:   # raise the bar
        return "READY"
    if overall_score >= 0.65:
        return "NEEDS_WORK"
    return "NOT_READY"

Change the Score Weights

Edit the weights dict in compute_overall_score() in backend/utils/formatters.py. Weights are automatically re-normalized if a metric is absent, so you can adjust without breaking missing-metric cases.

Add a New Evaluation Endpoint

Add a router file in backend/routers/ and register it in backend/main.py:

from routers.my_endpoint import router as my_router
app.include_router(my_router)

API Reference

Interactive docs available at http://localhost:8000/docs

Method	Path	Description
`GET`	`/health`	Health check
`POST`	`/evaluate`	Single evaluation (blocking)
`POST`	`/evaluate/stream`	Single evaluation with SSE progress
`POST`	`/evaluate/batch`	Batch evaluation with aggregates
`POST`	`/evaluate/compare`	Compare baseline vs candidate
`POST`	`/generate-dataset`	Generate synthetic golden dataset

Project Structure

rag-auditor/
│
├── .env.example                          # Template — copy to .env and add ANTHROPIC_API_KEY
├── docker-compose.yml                    # Orchestrates backend + frontend
│
├── backend/
│   ├── Dockerfile
│   ├── requirements.txt                  # ragas==0.1.21, anthropic, langchain-anthropic, fastapi
│   ├── main.py                           # FastAPI app, CORS middleware, router registration
│   │
│   ├── models/
│   │   ├── evaluation.py                 # EvaluationRequest/Response, Scores, Trace, Recommendations
│   │   └── dataset.py                    # GenerateDatasetRequest/Response, QAPair
│   │
│   ├── routers/
│   │   ├── evaluate.py                   # POST /evaluate, /evaluate/stream, /evaluate/batch, /evaluate/compare
│   │   ├── generate_dataset.py           # POST /generate-dataset
│   │   └── health.py                     # GET /health
│   │
│   ├── services/
│   │   ├── ragas_evaluator.py            # RAGAS metric runner — sync executor + async SSE streaming
│   │   ├── llm_judge.py                  # Claude hallucination detector + plain-English explanation
│   │   ├── trace_analyzer.py             # Maps scores → retrieval/generation stage issues + recommendations
│   │   └── dataset_generator.py          # RAGAS TestsetGenerator with Claude fallback
│   │
│   ├── utils/
│   │   └── formatters.py                 # clamp_score(), compute_overall_score(), score_to_verdict()
│   │
│   └── tests/
│       └── test_evaluate.py              # Unit tests for formatters, trace, recommendations (no API key)
│
└── frontend/
    ├── Dockerfile
    ├── vite.config.js                    # Dev proxy → localhost:8000
    ├── src/
    │   ├── App.jsx                       # Top-level layout + tab routing
    │   ├── components/
    │   │   ├── EvaluatorForm.jsx         # Single-sample input form
    │   │   ├── ResultsDashboard.jsx      # Score cards, verdict, explanation
    │   │   ├── TraceVisualizer.jsx       # Retrieval + generation stage breakdown
    │   │   ├── HallucinationBadge.jsx    # LOW / MEDIUM / HIGH risk badge
    │   │   ├── RecommendationsPanel.jsx  # Sorted fix recommendations
    │   │   ├── BatchEvaluator.jsx        # CSV/JSON upload + aggregate results
    │   │   ├── DatasetGenerator.jsx      # Doc input + dataset download
    │   │   ├── CompareMode.jsx           # Baseline vs candidate delta view
    │   │   ├── HistoryPanel.jsx          # In-memory session history + restore
    │   │   └── ScoreCard.jsx             # Reusable per-metric score component
    │   ├── hooks/
    │   │   ├── useEvaluate.js            # SSE streaming hook for /evaluate/stream
    │   │   └── useHistory.js             # In-memory session history state
    │   ├── api/
    │   │   └── client.js                 # Axios wrappers for all backend endpoints
    │   └── utils/
    │       └── scoreHelpers.js           # Color/label helpers for score display
    └── src/utils/
        └── scoreHelpers.test.js          # Frontend unit tests

Configuration

All settings are loaded from .env (copy from .env.example):

Variable	Default	Required	Description
`ANTHROPIC_API_KEY`	—	Yes	Anthropic API key — used for RAGAS judge LLM, hallucination detection, and explanations
`RAGAS_APP_TOKEN`	—	No	RAGAS Cloud token for dashboard and experiment tracking
`CORS_ORIGINS`	`http://localhost:3000`	No	Comma-separated list of allowed frontend origins

Contributing

See CONTRIBUTING.md for how to submit issues, PRs, and run tests.

License

MIT — see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

🔍 RAG Auditor

RAG Ecosystem

Table of Contents

The Problem

What It Does

Evaluation Dimensions

Demo

Key Features

Built On

Quickstart

Architecture

Step-by-Step Testing Guide

Prerequisites

Step 1 — Set Up Your Environment

Step 2 — Start the Backend

Step 3 — Generate a Synthetic Golden Dataset

Via the UI

Via curl

Save the dataset for reuse

Dataset evolution types

Step 4 — Evaluate a Single RAG Response

Via the UI

Via curl (non-streaming)

Via curl (streaming SSE)

Step 5 — Evaluate a Batch

Using a generated dataset directly

Step 6 — Compare Two Evaluations

Step 7 — Run the Automated Test Suite

Understanding Verdicts

RAGAS Metrics Explained

Interpreting Recommendations

How LLM-as-Judge Works

Integration: Evaluating multi-llm-rag-agent-chat

Step 1 — Generate a Golden Dataset from Your Documents

Step 2 — Capture Live Responses from the Chatbot

Changes needed to multi-llm-rag-agent-chat

Step 3 — Run Batch Evaluation

Step 4 — Compare GPT-4o vs Gemini Routing

Step 5 — Measure RLHF Improvement Over Time

What to Watch For

Key Design Decisions

1. RAGAS + Claude in combination, not either/or

2. Weighted overall score, not a simple average

3. SSE streaming over polling

4. Three-tier verdict, not a score

5. RAGAS → Claude fallback for dataset generation

6. Recommendations sorted by severity, not metric

Extending the System

Swap the LLM Judge

Add a New RAGAS Metric

Change the Verdict Thresholds

Change the Score Weights

Add a New Evaluation Endpoint

API Reference

Project Structure

Configuration

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages