Local RAG fact-checking for paleontology claims — retrieve evidence from a Chroma vector store, reason with a local Llama-2 GGUF model, and return structured verdicts (True / False / Insufficient information).
No cloud API required. Documents stay on your machine.
| Area | What you get |
|---|---|
| Retrieval | Sentence-transformer embeddings + ChromaDB over PDFs, DOCX, and Wikipedia sources |
| Inference | Lazy-loaded Llama-2 via llama-cpp-python (model + retrieval imports load on first check) |
| CLI | python main.py with --top-k and optional --build-dataset |
| HTTP API | FastAPI /ask returns verdict, sources, claim, and human-readable answer |
| Config | CHROMA_DIR, LLAMA_MODEL_DIR, LLAMA_MODEL_PATH via .env (loaded by python-dotenv) |
| Quality | Unit tests for claim normalization, verdict parsing, and the run_fact_check pipeline |
flowchart LR
subgraph ingest
PDF[PDF / DOCX / Wiki]
Split[Chunk & embed]
Chroma[(ChromaDB)]
PDF --> Split --> Chroma
end
subgraph runtime
CLI[CLI / FastAPI]
RAG[retrieve_claim_context]
LLM[Llama-2 GGUF]
CLI --> RAG
RAG --> Chroma
RAG --> LLM
LLM --> Verdict[FactCheckResult]
end
- Ingest —
data_builder.pyparses sources, splits text, writes embeddings to Chroma. - Retrieve —
retrieval.pyfetches the top-k chunks for a claim. - Verify —
fact_check.pyprompts Llama with context and normalizes the model output. - Respond — CLI prints text; API returns JSON with metadata.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtcp .env.example .env
# Edit paths if needed:
# CHROMA_DIR=chroma_db
# LLAMA_MODEL_PATH=/path/to/your/model.ggufpython model_loader.pypython -c "from data_processing.data_builder import build_dataset; build_dataset()"Or rebuild on every CLI run:
python main.py --build-dataset "Tyrannosaurus was a carnivore."# Default demo claim
python main.py
# Custom claim
python main.py "Ankylosaurs were herbivores."
# Retrieve more context chunks (default: 3)
python main.py "Feathered dinosaurs existed." --top-k 5Example output
Checking claim...
True
Sources: wiki:Feathered_dinosaur, doc:paleo_survey.pdf
uvicorn FastAPI:app --reload| Endpoint | Description |
|---|---|
GET /health |
Liveness probe (does not load the LLM) |
POST /ask |
Fact-check a claim |
Request
curl -s -X POST http://127.0.0.1:8000/ask \
-H "Content-Type: application/json" \
-d '{"query": "Spinosaurus was primarily aquatic.", "top_k": 4}'Response
{
"answer": "True\nSources: wiki:Spinosaurus",
"verdict": "True",
"sources": ["wiki:Spinosaurus"],
"claim": "Spinosaurus was primarily aquatic."
}| Variable | Default | Purpose |
|---|---|---|
CHROMA_DIR |
chroma_db |
Chroma persistence directory |
LLAMA_MODEL_DIR |
models |
Folder searched for .gguf files |
LLAMA_MODEL_PATH |
(auto) | Explicit path to a single GGUF model |
Values are read from .env via python-dotenv; environment variables take precedence.
python -m unittest discover -s tests -vCovers:
normalize_claim()— plain text, Python string literals, whitespace, non-string inputparse_verdict()— canonical labels and empty responsesrun_fact_check()— empty-claim short-circuit, retrieval integration, lazy LLM invocation
PaleoFactCheck/
├── main.py # CLI entrypoint (--top-k, --build-dataset)
├── FastAPI.py # REST API
├── fact_check.py # RAG + lazy Llama inference
├── fact_check_parsing.py # Claim/verdict helpers (tested)
├── model_loader.py # GGUF download & path resolution
├── requirements.txt # Pinned runtime dependencies (exact ==)
├── data_processing/
│ ├── data_builder.py # Ingest pipeline
│ ├── data_layer.py # Chroma + embeddings (honors CHROMA_DIR)
│ ├── retrieval.py # Top-k context for a claim
│ └── … # PDF, DOCX, Wikipedia parsers
└── tests/
└── test_fact_check_utils.py
See repository license. Model weights (Llama-2 GGUF) are subject to Meta's license — download separately via model_loader.py.