This repo is part of a broader RAG toolkit:
| Repo | What it covers |
|---|---|
| rag-auditor ← you are here | Evaluate your RAG pipeline |
| multi-llm-rag-agent-chat | Build a production RAG chatbot with multi-LLM routing |
| rag-system-design-guide | Design reference — architecture patterns and trade-offs |
Start with the design guide, build with the chatbot, evaluate with this.
- RAG Ecosystem
- The Problem
- What It Does
- Demo
- Key Features
- Built On
- Quickstart
- Architecture
- Step-by-Step Testing Guide
- Understanding Verdicts
- RAGAS Metrics Explained
- Interpreting Recommendations
- How LLM-as-Judge Works
- Integration: Evaluating multi-llm-rag-agent-chat
- Key Design Decisions
- Extending the System
- API Reference
- Key Files
- Configuration
- Contributing
Most RAG systems ship broken.
Not broken in obvious ways — broken in the ways that matter: answers that sound confident but contradict the source documents, retrieved chunks that miss the point entirely, prompts that quietly hallucinate under edge cases.
The teams building these systems usually know something's off. They just have no way to measure it systematically — so they eyeball outputs, cross their fingers, and ship.
RAG Auditor fixes that.
RAG Auditor is an open source evaluation platform that automatically scores your RAG pipeline across the four dimensions that predict real-world quality — then tells you exactly what to fix and how.
Input: Your question + Retrieved context chunks + RAG-generated answer
Output: A production-readiness verdict with scored diagnostics and fix recommendations
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Faithfulness | Is the answer grounded in the retrieved context? | Catches hallucinations |
| Answer Relevancy | Does the answer actually address the question? | Catches non-answers |
| Context Precision | Are retrieved chunks signal or noise? | Catches retrieval bloat |
| Context Recall | Did retrieval surface the right information? | Catches retrieval gaps |
| Hallucination Risk | LOW / MEDIUM / HIGH classification |
Human-readable safety check |
Each score comes with a plain-English explanation and a specific, actionable recommendation.
⚡ Paste a question, context, and answer. Get a verdict in ~10 seconds.
Question: "What is our refund policy for digital products?"
Context: [3 retrieved chunks from your knowledge base]
Answer: "You can get a refund within 30 days, no questions asked."
──────────────────────────────────────────────
Overall Score 0.84 ● NEEDS WORK
──────────────────────────────────────────────
Faithfulness 0.91 ✓ Strong
Answer Relevancy 0.88 ✓ Strong
Context Precision 0.67 ⚠ Warning
Context Recall 0.79 ● Review
Hallucination Risk LOW ✓ Safe
──────────────────────────────────────────────
Top Issue: Context precision is low — your retriever is pulling in
irrelevant chunks alongside the relevant ones. Try reducing top-k
from 5 to 3, or add a reranking step before generation.
──────────────────────────────────────────────
- Single-sample evaluator — paste one Q/context/answer, get instant scores
- Batch evaluation — upload CSV/JSON for bulk pipeline testing
- Golden dataset generator — paste your source docs, auto-generate synthetic Q&A test pairs (the #1 reason teams skip evals is they have no dataset — this removes that blocker entirely)
- Compare mode — run before/after evals when you change chunking, top-k, or prompts; see exact delta per metric
- Evaluation history — save results in the current browser session and restore them into the evaluator to compare runs or keep iterating
- RAG trace visualization — see scores annotated at each stage: Query → Retrieval → Prompt Construction → Generation → Answer
- LLM-as-judge — Claude evaluates hallucination risk with reasoning, not just a number
- Fix recommendations — every low score maps to a specific, actionable suggestion
RAG Auditor is a product layer built on top of battle-tested open source infrastructure:
- RAGAS — the leading RAG evaluation framework, providing the core faithfulness, relevancy, precision, and recall metrics
- Claude (
claude-sonnet-4-6) — LLM-as-judge for hallucination detection and plain-English explanations - FastAPI — async Python backend with SSE streaming
- React + Recharts — dashboard UI
RAGAS provides the scoring science. Claude provides the reasoning layer. RAG Auditor provides the product experience that makes both usable without a PhD.
Docker (recommended):
git clone https://github.com/yourusername/rag-auditor
cd rag-auditor
cp .env.example .env # Add your ANTHROPIC_API_KEY
docker-compose upOpen http://localhost:3000 — that's it.
Local development:
# Backend
cd backend
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
uvicorn main:app --reload --port 8000
# Frontend (new terminal)
cd frontend && npm install && npm run dev┌─────────────────────────────────────────────┐
│ React UI │
│ Evaluator · Batch · Generator · Compare │
└──────────────────┬──────────────────────────┘
│ REST + SSE
┌──────────────────▼──────────────────────────┐
│ FastAPI Backend │
│ │
│ ┌─────────────┐ ┌────────────────────┐ │
│ │ RAGAS │ │ Claude (Judge) │ │
│ │ Evaluator │ │ Hallucination │ │
│ │ │ │ Detection + │ │
│ │ Faithfulness│ │ Recommendations │ │
│ │ Relevancy │ │ Plain-English │ │
│ │ Precision │ │ Explanations │ │
│ │ Recall │ └────────────────────┘ │
│ └─────────────┘ │
│ └──────────── asyncio.gather() ─────┘
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Recommendation Engine │ │
│ │ Score → Root Cause → Specific Fix │ │
│ └──────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
This section walks you through the full workflow: generating a dataset, evaluating it, and running automated tests.
- Python 3.11+
- Node.js 18+
- An Anthropic API key
cd backend
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .envEdit .env and set:
ANTHROPIC_API_KEY=sk-ant-...
CORS_ORIGINS=http://localhost:3000
cd backend
uvicorn main:app --reload --port 8000Verify it's running:
curl http://localhost:8000/health
# → {"status": "ok"}Interactive API docs are available at http://localhost:8000/docs
Before you can evaluate your RAG system, you need Q&A pairs with ground truth. Use the Dataset Generator to create them from your own source documents.
- Open http://localhost:3000
- Click Dataset Generator in the nav
- Paste 1–5 source documents (e.g., product docs, knowledge-base articles)
- Set the number of Q&A pairs (1–100)
- Click Generate
- Download as JSON or CSV
CSV exports store contexts as a JSON array string so they round-trip cleanly into the Batch Evaluator.
The Batch Evaluator also accepts the older pipe-delimited contexts format for compatibility.
curl -X POST http://localhost:8000/generate-dataset \
-H "Content-Type: application/json" \
-d '{
"documents": [
"RAG stands for Retrieval-Augmented Generation. It is a technique that combines information retrieval with large language models. The retrieval step fetches relevant documents from a knowledge base. The generation step uses the retrieved documents as context to produce an answer.",
"Faithfulness measures whether every claim in the generated answer can be traced back to the retrieved context. A faithfulness score of 1.0 means the answer is fully grounded in the provided documents."
],
"num_questions": 5
}'Response structure:
{
"pairs": [
{
"question": "What does RAG stand for?",
"answer": "RAG stands for Retrieval-Augmented Generation.",
"ground_truth": "RAG stands for Retrieval-Augmented Generation.",
"contexts": ["RAG stands for Retrieval-Augmented Generation..."],
"evolution_type": "simple"
}
],
"total": 5,
"source_documents": 2
}The generator first attempts RAGAS TestsetGenerator (which creates diverse question types: simple, reasoning, multi-context). If RAGAS is unavailable it falls back to Claude directly. Either path produces the same output format.
curl -X POST http://localhost:8000/generate-dataset \
-H "Content-Type: application/json" \
-d '{"documents": ["...your text..."], "num_questions": 10}' \
-o my_dataset.json| Type | Description | Distribution |
|---|---|---|
simple |
Direct factual questions from one document | 50% |
reasoning |
Questions requiring inference or multi-step thinking | 30% |
multi_context |
Questions that require combining multiple documents | 20% |
Each pair contains:
question— the user query to pose to your RAG systemanswer— a reference answer (useful for comparison)ground_truth— the canonical correct answer (used for context recall)contexts— the source passages (use these as your "retrieved chunks" in evaluation)evolution_type— question category
Tip: To evaluate your own RAG system with the generated dataset, replace
contextswith the chunks your retriever actually returns, and replaceanswerwith what your LLM generates. Keepground_truthas-is.
Use a Q&A pair from your dataset (or write one manually) and evaluate it.
- Open http://localhost:3000
- Fill in Question, Answer, Retrieved Contexts (one per line), and optionally Ground Truth
- Choose mode: Full (all metrics) or Quick (skips context recall even if Ground Truth is provided, faster)
- Click Evaluate
- View per-metric scores, hallucination badge, trace visualization, and recommendations
curl -X POST http://localhost:8000/evaluate \
-H "Content-Type: application/json" \
-d '{
"question": "What is RAG?",
"answer": "RAG stands for Retrieval-Augmented Generation. It retrieves relevant documents and uses them to generate answers.",
"contexts": [
"RAG stands for Retrieval-Augmented Generation. It is a technique that combines information retrieval with large language models.",
"The retrieval step fetches relevant documents. The generation step uses them as context."
],
"ground_truth": "RAG retrieves relevant documents from a knowledge base and uses them as context to generate accurate answers.",
"mode": "full"
}'Response structure:
{
"overall_score": 0.87,
"scores": {
"faithfulness": 0.95,
"answer_relevancy": 0.88,
"context_precision": 0.80,
"context_recall": 0.75,
"hallucination_risk": "low"
},
"trace": {
"retrieval_stage": {"score": 0.775, "issues": []},
"generation_stage": {"score": 0.915, "issues": []}
},
"recommendations": [...],
"verdict": "READY",
"explanation": "Your RAG pipeline is well-grounded and relevant..."
}The streaming endpoint yields real-time progress events followed by the final result:
curl -N -X POST http://localhost:8000/evaluate/stream \
-H "Content-Type: application/json" \
-d '{
"question": "What is RAG?",
"answer": "RAG stands for Retrieval-Augmented Generation.",
"contexts": ["RAG stands for Retrieval-Augmented Generation..."],
"ground_truth": "RAG retrieves documents and generates answers.",
"mode": "full"
}'Events emitted (one per line, data: {...}):
{"type": "progress", "message": "Initializing evaluation engine...", "step": 0, "total": 5}{"type": "progress", "message": "Checking answer faithfulness...", "step": 1, "total": 5}{"type": "progress", "message": "Running hallucination check..."}{"type": "scores", "scores": {...}}{"type": "result", "data": {...full EvaluationResponse...}}[DONE]
Use this when you have a full dataset and want aggregate statistics.
In the UI, batch upload accepts JSON files directly and CSV files with contexts stored either as a JSON array string or as a legacy pipe-delimited field.
curl -X POST http://localhost:8000/evaluate/batch \
-H "Content-Type: application/json" \
-d '{
"samples": [
{
"question": "What is RAG?",
"answer": "RAG is Retrieval-Augmented Generation.",
"contexts": ["RAG combines retrieval with generation."],
"ground_truth": "RAG retrieves documents to generate answers.",
"mode": "full"
},
{
"question": "What does faithfulness measure?",
"answer": "Faithfulness measures if claims trace back to context.",
"contexts": ["Faithfulness measures whether every claim in the answer can be traced back to context."],
"ground_truth": "Faithfulness measures grounding of the answer in retrieved context.",
"mode": "full"
}
]
}'Response includes:
{
"aggregate": {
"faithfulness": 0.91,
"answer_relevancy": 0.85,
"context_precision": 0.78,
"context_recall": 0.72,
"overall_score": 0.85
},
"verdict_distribution": {"READY": 2},
"total_samples": 2,
"successful": 2,
"failed": 0,
"results": [...],
"errors": []
}# Generate
curl -X POST http://localhost:8000/generate-dataset \
-H "Content-Type: application/json" \
-d '{"documents": ["..."], "num_questions": 5}' \
-o dataset.json
# Transform to batch format with jq
cat dataset.json | jq '{samples: [.pairs[] | {question, answer, contexts, ground_truth, mode: "full"}]}' \
> batch_input.json
# Evaluate
curl -X POST http://localhost:8000/evaluate/batch \
-H "Content-Type: application/json" \
-d @batch_input.jsonAfter changing your RAG pipeline, compare before and after:
# Save your baseline result
curl -X POST http://localhost:8000/evaluate \
-H "Content-Type: application/json" \
-d '{"question": "...", "answer": "old answer", "contexts": [...], "mode": "full"}' \
-o baseline.json
# Save your candidate result
curl -X POST http://localhost:8000/evaluate \
-H "Content-Type: application/json" \
-d '{"question": "...", "answer": "improved answer", "contexts": [...], "mode": "full"}' \
-o candidate.json
# Compare
curl -X POST http://localhost:8000/evaluate/compare \
-H "Content-Type: application/json" \
-d "{\"baseline\": $(cat baseline.json), \"candidate\": $(cat candidate.json)}"Response:
{
"deltas": [
{"metric": "faithfulness", "baseline": 0.70, "candidate": 0.90, "delta": 0.20, "direction": "improved"},
{"metric": "answer_relevancy", "baseline": 0.85, "candidate": 0.82, "delta": -0.03, "direction": "regressed"}
],
"summary": "Mixed results: faithfulness improved by 20.0% but answer_relevancy regressed by 3.0%.",
"overall_direction": "mixed"
}Unit tests cover formatters, trace analysis, and recommendation logic — no API key required.
cd backend
source venv/bin/activate
pytest tests/ -vExpected output:
tests/test_evaluate.py::TestFormatters::test_clamp_score_valid PASSED
tests/test_evaluate.py::TestFormatters::test_compute_overall_score_all_metrics PASSED
tests/test_evaluate.py::TestFormatters::test_verdict_ready PASSED
tests/test_evaluate.py::TestFormatters::test_verdict_not_ready_high_hallucination PASSED
tests/test_evaluate.py::TestRecommendations::test_critical_faithfulness PASSED
tests/test_evaluate.py::TestRecommendations::test_recommendations_sorted_by_severity PASSED
tests/test_evaluate.py::TestTraceAnalyzer::test_analyze_trace_with_all_scores PASSED
...
Run frontend tests:
cd frontend
npm test| Verdict | Condition |
|---|---|
| READY | Overall score ≥ 0.80 AND hallucination risk is not high |
| NEEDS_WORK | Overall score ≥ 0.60 (and hallucination not high) |
| NOT_READY | Overall score < 0.60, OR hallucination risk is high |
The overall score is a weighted average:
| Metric | Weight |
|---|---|
| Faithfulness | 35% |
| Answer Relevancy | 30% |
| Context Precision | 20% |
| Context Recall | 15% |
| Metric | What it measures | Requires ground truth? |
|---|---|---|
| Faithfulness | Does every claim in the answer trace back to the retrieved context? A score of 1.0 means the answer is fully grounded. | No |
| Answer Relevancy | Does the answer actually address the question asked? Low scores mean the answer is off-topic. | No |
| Context Precision | What fraction of the retrieved chunks are actually relevant? Low scores mean your retriever is returning noise. | No |
| Context Recall | Was all the relevant information present in the retrieved chunks? Low scores mean your retriever missed important content. | Yes |
Hallucination Risk is an additional LLM-as-judge assessment (not from RAGAS) that classifies whether the answer introduces information not in the context: low, medium, or high.
Every evaluation returns prioritized recommendations sorted by severity:
| Severity | Score range | Example fix |
|---|---|---|
| critical | < 0.50 | Rewrite your generation prompt to forbid external knowledge |
| warning | 0.50–0.70 | Tighten top-k, add reranking |
| info | ≥ 0.80 | Monitor production metrics |
RAG Auditor uses Claude (claude-sonnet-4-6) for three purposes:
- RAGAS metrics — Claude is the judge LLM for all RAGAS computations (faithfulness, answer relevancy, context precision, context recall)
- Hallucination detection — A custom Claude prompt analyzes whether the answer introduces unsupported claims, returning
risk_level,confidence,unsupported_claims, andrationale - Plain-English explanation — Claude synthesizes scores into a 2–3 sentence summary of what to fix
multi-llm-rag-agent-chat is a production RAG chatbot with dual-LLM routing (GPT-4o / Gemini Flash), ChromaDB vector storage, HuggingFace embeddings (all-MiniLM-L6-v2), and an RLHF feedback loop. RAG Auditor is the ideal complement — it gives you objective metric scores for every dimension of that pipeline.
multi-llm-rag-agent-chat RAG Auditor
───────────────────────── ─────────────────────────────────
Upload documents ──► Generate golden dataset from same docs
Ask question ──► Capture question + answer + contexts
ChromaDB retrieval (top 6)──► Evaluate context_precision / context_recall
GPT-4o or Gemini answer ──► Evaluate faithfulness / answer_relevancy
RLHF re-ranking active ──► Re-run batch eval, compare delta scores
Use the same documents you uploaded to the chatbot to generate ground-truth Q&A pairs:
curl -X POST http://localhost:8000/generate-dataset \
-H "Content-Type: application/json" \
-d '{
"documents": [
"paste the text content of one of your uploaded PDFs or web pages here",
"paste a second document here"
],
"num_questions": 20
}' \
-o golden_dataset.jsonThese Q&A pairs become your evaluation harness. The ground_truth field is what you use to score context recall.
For each question in your golden dataset, query the chatbot and capture the full response including the retrieved source chunks. The chatbot's chat endpoint accepts { "query": "...", "session_id": "..." } and returns answer + sources.
Note on content truncation: The chatbot's
sources[].contentfield is currently truncated to 300 characters (doc.page_content[:300]inchat.py). This is fine for UI display but too short for RAGAS to compute accurate faithfulness and recall scores. See Changes needed to multi-llm-rag-agent-chat below for the one-line fix.
import httpx, json, uuid
golden = json.load(open("golden_dataset.json"))
samples = []
session_id = str(uuid.uuid4())
for pair in golden["pairs"]:
# Query the chatbot backend — field is "query", not "message"
resp = httpx.post(
"http://localhost:8001/chat",
json={"query": pair["question"], "session_id": session_id},
timeout=60,
)
data = resp.json()
samples.append({
"question": pair["question"],
"answer": data["answer"],
# sources[].content is truncated to 300 chars by default — apply the
# full_content fix (see below) to get meaningful RAGAS scores
"contexts": [s["content"] for s in data["sources"]],
"ground_truth": pair["ground_truth"],
"mode": "full",
})
json.dump({"samples": samples}, open("batch_input.json", "w"))Only one change is required in the chatbot to make it fully compatible with RAG Auditor evaluation.
Problem: backend/routers/chat.py truncates source content to 300 chars:
# current — too short for RAGAS
Source(content=doc.page_content[:300], ...)Fix: Return the full chunk content (or add a full_content field alongside the truncated preview):
# option A — return full content (evaluation-friendly, slightly larger payload)
Source(content=doc.page_content, ...)
# option B — keep the 300-char preview for UI, add full content for eval
Source(
content=doc.page_content[:300], # UI display
full_content=doc.page_content, # evaluation use
...
)If you go with option B, update the integration script to use s["full_content"] instead of s["content"].
No other changes are required — the chatbot's API shape (query, answer, sources, chunk_ids, llm_used, complexity_score) maps cleanly to RAG Auditor's evaluation input.
curl -X POST http://localhost:8000/evaluate/batch \
-H "Content-Type: application/json" \
-d @batch_input.json \
-o batch_results.json
# Quick summary
cat batch_results.json | jq '.aggregate'This gives you aggregate scores across your whole document corpus — exactly what you need to decide if the pipeline is production-ready.
The chatbot routes queries above complexity threshold 0.4 to GPT-4o and below to Gemini Flash. Use RAG Auditor's compare mode to measure whether the routing decision actually improves quality:
import httpx, json
question = "What are the key architectural trade-offs in microservices?" # high complexity
contexts = ["...retrieved chunks..."]
ground_truth = "..."
# Force GPT-4o answer (or capture from a high-complexity query)
gpt4o_eval = httpx.post("http://localhost:8000/evaluate", json={
"question": question,
"answer": "GPT-4o generated answer here",
"contexts": contexts,
"ground_truth": ground_truth,
"mode": "full"
}).json()
# Capture Gemini answer (low-complexity routing)
gemini_eval = httpx.post("http://localhost:8000/evaluate", json={
"question": question,
"answer": "Gemini Flash generated answer here",
"contexts": contexts,
"ground_truth": ground_truth,
"mode": "full"
}).json()
# Compare
compare = httpx.post("http://localhost:8000/evaluate/compare", json={
"baseline": gemini_eval,
"candidate": gpt4o_eval
}).json()
print(compare["summary"])
# e.g. "GPT-4o improved faithfulness by 12.0% and answer_relevancy by 8.0%."This tells you whether the routing threshold (0.4) is correctly placed — if GPT-4o isn't consistently outscoring Gemini on hard questions, you may need to adjust the threshold.
The chatbot's RLHF loop re-ranks ChromaDB results based on user thumbs up/down. To measure whether feedback is actually improving retrieval quality:
# Baseline: evaluate before users have given feedback
curl -X POST http://localhost:8000/evaluate/batch \
-H "Content-Type: application/json" \
-d @batch_input.json \
-o before_rlhf.json
# ... let users interact with the chatbot and submit feedback ...
# Re-run: same questions, same contexts, re-capture from chatbot
curl -X POST http://localhost:8000/evaluate/batch \
-H "Content-Type: application/json" \
-d @batch_input_after.json \
-o after_rlhf.json
# Compare aggregate context_precision scores
cat before_rlhf.json | jq '.aggregate.context_precision'
cat after_rlhf.json | jq '.aggregate.context_precision'An increase in context_precision after RLHF feedback confirms that re-ranking is surfacing higher-signal chunks. An increase in context_recall confirms fewer relevant chunks are being missed.
| Metric | What it reveals about the chatbot |
|---|---|
| context_precision | Whether ChromaDB's cosine similarity retrieval (top 6 → top 3) is pulling in noise |
| context_recall | Whether all-MiniLM-L6-v2 embeddings capture the semantic meaning of your domain |
| faithfulness | Whether GPT-4o / Gemini is staying grounded or hallucinating beyond retrieved chunks |
| answer_relevancy | Whether the complexity router is selecting the right LLM for each query type |
| hallucination_risk | Claude's independent assessment — useful as a cross-check on the routing decision |
Expected baseline:
all-MiniLM-L6-v2is a lightweight embedding model optimized for speed, not domain accuracy. Ifcontext_recallscores below 0.70 consistently, consider upgrading to a larger embedding model (e.g.BAAI/bge-large-en-v1.5) and re-running the batch eval to measure the improvement.
RAGAS provides statistically rigorous, reproducible metrics based on dataset science. Claude provides contextual reasoning that RAGAS cannot — specifically hallucination detection and plain-English explanations. Running both in parallel via asyncio.gather() means neither adds latency to the other.
Faithfulness (35%) is weighted highest because hallucinating content is the most damaging RAG failure mode. Answer relevancy (30%) is second because an off-topic answer is equally useless regardless of how well it's grounded. Context metrics are weighted lower (20%/15%) because they diagnose the retriever, which is fixable without touching the LLM.
The /evaluate/stream endpoint emits progress events per metric so the UI can update in real time as each RAGAS metric completes. This avoids a blank "loading" state during what can be a 10–30 second evaluation.
READY / NEEDS_WORK / NOT_READY gives developers and stakeholders a clear go/no-go signal without needing to interpret a float. Hallucination risk overrides the score: even a 0.95 overall score is NOT_READY if Claude classifies hallucination as high.
The dataset generator first attempts RAGAS TestsetGenerator (which produces richer, more diverse question types using multi-hop reasoning). If RAGAS is unavailable or fails, it falls back to a direct Claude prompt that produces the same JSON schema. The caller never needs to know which path ran.
Critical issues (score < 0.50) surface first regardless of which metric produced them. This matches how a developer would triage — fix the worst thing first, then warnings, then informational.
All Claude calls are isolated in backend/services/llm_judge.py. To use a different model, change the model parameter:
# llm_judge.py
response = await client.messages.create(
model="claude-sonnet-4-6", # change this
...
)To use a different provider entirely, replace the anthropic.AsyncAnthropic client with any async client that accepts the same prompt structure.
In backend/services/ragas_evaluator.py, add your metric to metrics_config in stream_ragas_evaluation() and to _run_ragas_sync():
from ragas.metrics import answer_correctness # example new metric
metrics_config = [
...
("answer_correctness", "Checking answer correctness..."),
]Then add it to the Scores model in backend/models/evaluation.py and the weighting dict in backend/utils/formatters.py.
Edit score_to_verdict() in backend/utils/formatters.py:
def score_to_verdict(overall_score: float, hallucination_risk) -> str:
if hallucination_risk == "high":
return "NOT_READY"
if overall_score >= 0.85: # raise the bar
return "READY"
if overall_score >= 0.65:
return "NEEDS_WORK"
return "NOT_READY"Edit the weights dict in compute_overall_score() in backend/utils/formatters.py. Weights are automatically re-normalized if a metric is absent, so you can adjust without breaking missing-metric cases.
Add a router file in backend/routers/ and register it in backend/main.py:
from routers.my_endpoint import router as my_router
app.include_router(my_router)Interactive docs available at http://localhost:8000/docs
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check |
POST |
/evaluate |
Single evaluation (blocking) |
POST |
/evaluate/stream |
Single evaluation with SSE progress |
POST |
/evaluate/batch |
Batch evaluation with aggregates |
POST |
/evaluate/compare |
Compare baseline vs candidate |
POST |
/generate-dataset |
Generate synthetic golden dataset |
rag-auditor/
│
├── .env.example # Template — copy to .env and add ANTHROPIC_API_KEY
├── docker-compose.yml # Orchestrates backend + frontend
│
├── backend/
│ ├── Dockerfile
│ ├── requirements.txt # ragas==0.1.21, anthropic, langchain-anthropic, fastapi
│ ├── main.py # FastAPI app, CORS middleware, router registration
│ │
│ ├── models/
│ │ ├── evaluation.py # EvaluationRequest/Response, Scores, Trace, Recommendations
│ │ └── dataset.py # GenerateDatasetRequest/Response, QAPair
│ │
│ ├── routers/
│ │ ├── evaluate.py # POST /evaluate, /evaluate/stream, /evaluate/batch, /evaluate/compare
│ │ ├── generate_dataset.py # POST /generate-dataset
│ │ └── health.py # GET /health
│ │
│ ├── services/
│ │ ├── ragas_evaluator.py # RAGAS metric runner — sync executor + async SSE streaming
│ │ ├── llm_judge.py # Claude hallucination detector + plain-English explanation
│ │ ├── trace_analyzer.py # Maps scores → retrieval/generation stage issues + recommendations
│ │ └── dataset_generator.py # RAGAS TestsetGenerator with Claude fallback
│ │
│ ├── utils/
│ │ └── formatters.py # clamp_score(), compute_overall_score(), score_to_verdict()
│ │
│ └── tests/
│ └── test_evaluate.py # Unit tests for formatters, trace, recommendations (no API key)
│
└── frontend/
├── Dockerfile
├── vite.config.js # Dev proxy → localhost:8000
├── src/
│ ├── App.jsx # Top-level layout + tab routing
│ ├── components/
│ │ ├── EvaluatorForm.jsx # Single-sample input form
│ │ ├── ResultsDashboard.jsx # Score cards, verdict, explanation
│ │ ├── TraceVisualizer.jsx # Retrieval + generation stage breakdown
│ │ ├── HallucinationBadge.jsx # LOW / MEDIUM / HIGH risk badge
│ │ ├── RecommendationsPanel.jsx # Sorted fix recommendations
│ │ ├── BatchEvaluator.jsx # CSV/JSON upload + aggregate results
│ │ ├── DatasetGenerator.jsx # Doc input + dataset download
│ │ ├── CompareMode.jsx # Baseline vs candidate delta view
│ │ ├── HistoryPanel.jsx # In-memory session history + restore
│ │ └── ScoreCard.jsx # Reusable per-metric score component
│ ├── hooks/
│ │ ├── useEvaluate.js # SSE streaming hook for /evaluate/stream
│ │ └── useHistory.js # In-memory session history state
│ ├── api/
│ │ └── client.js # Axios wrappers for all backend endpoints
│ └── utils/
│ └── scoreHelpers.js # Color/label helpers for score display
└── src/utils/
└── scoreHelpers.test.js # Frontend unit tests
All settings are loaded from .env (copy from .env.example):
| Variable | Default | Required | Description |
|---|---|---|---|
ANTHROPIC_API_KEY |
— | Yes | Anthropic API key — used for RAGAS judge LLM, hallucination detection, and explanations |
RAGAS_APP_TOKEN |
— | No | RAGAS Cloud token for dashboard and experiment tracking |
CORS_ORIGINS |
http://localhost:3000 |
No | Comma-separated list of allowed frontend origins |
See CONTRIBUTING.md for how to submit issues, PRs, and run tests.
MIT — see LICENSE