An agentic retrieval-augmented generation system implementing fourteen state-of-the-art techniques within a single-GPU deployment constraint.
Documentation · Architecture · Techniques · API Reference
Contemporary retrieval-augmented generation systems typically implement two to four retrieval techniques — hybrid search, reranking, and perhaps query expansion — operating within a fixed, single-pass pipeline. Forge challenges this paradigm by unifying fourteen distinct techniques into a coherent agentic architecture where the language model itself orchestrates retrieval decisions through iterative reasoning.
The system introduces three architectural departures from conventional RAG:
-
Agentic retrieval — a LangGraph state machine replaces the fixed retrieve-then-generate pipeline, enabling the model to select tools, evaluate intermediate results, and re-retrieve when evidence is insufficient.
-
Multi-granularity indexing — documents are decomposed into a four-level hierarchy (document summaries, section abstracts, semantic chunks, and atomic propositions), each level serving different query characteristics.
-
Pre-generation verification — a corrective retrieval gate evaluates document relevance before generation, while post-generation self-verification audits each claim against its cited sources.
All fourteen techniques operate within a 16GB VRAM budget through careful architectural partitioning: the language model occupies the GPU exclusively, while embedding, reranking, and vector operations execute on CPU with negligible latency impact.
Forge implements each technique to address a specific, empirically documented failure mode in retrieval-augmented generation:
| # | Technique | Failure Mode Addressed | Implementation |
|---|---|---|---|
| 1 | Agentic RAG | Single-shot retrieval misses multi-hop evidence | LangGraph ReAct loop with 6 tools, max 3 iterations |
| 2 | CRAG Quality Gate | Irrelevant documents degrade generation quality | Cross-encoder scoring with CORRECT/AMBIGUOUS/INCORRECT classification |
| 3 | Multi-Hop Reasoning | Complex questions requiring cross-document synthesis | Agent decomposes queries and follows cross-references iteratively |
| # | Technique | Failure Mode Addressed | Implementation |
|---|---|---|---|
| 4 | Contextual Retrieval | Chunks lose meaning without surrounding context | LLM-generated context prefix prepended before embedding (Anthropic, 2024) |
| 5 | Proposition Indexing | Chunk-level granularity too coarse for factual queries | Dense-X atomic claim extraction indexed as L3 points |
| 6 | Hierarchical 4-Level | Single granularity cannot serve diverse query types | L0 summaries → L1 sections → L2 chunks → L3 propositions |
| # | Technique | Failure Mode Addressed | Implementation |
|---|---|---|---|
| 7 | BGE-M3 Tri-Modal Vectors | Separate dense and sparse pipelines add complexity | Single model producing dense, sparse, and ColBERT representations |
| 8 | ColBERT Late Interaction | Dense vectors compress away token-level distinctions | Qdrant multi-vector MaxSim scoring for precision reranking |
| 9 | Knowledge Graph | Vector similarity cannot capture structural relationships | Entity-relationship extraction with Redis adjacency traversal |
| # | Technique | Failure Mode Addressed | Implementation |
|---|---|---|---|
| 10 | Self-Verification | Generated claims may lack source support | Post-generation claim extraction and source matching |
| 11 | 6-Signal Confidence | Binary confidence insufficient for nuanced reliability | Weighted composite of retrieval, CRAG, verification, and alignment signals |
| 12 | Query Decomposition | Multi-part questions retrieve poorly as single queries | Agent-driven decomposition into targeted sub-queries |
| 13 | HyDE | Queries and documents occupy different semantic spaces | Hypothetical document generation for improved embedding alignment |
| 14 | Parent Expansion | Retrieved chunks lack surrounding context | Automatic expansion to parent section upon retrieval |
┌─────────────────────────────────────────────────────────────────┐
│ FORGE V5 │
│ │
│ Desktop Shell (Tauri + React) │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Agent Reasoning Panel · Chat · Sources · Verification │ │
│ └────────────────────────────┬─────────────────────────────┘ │
│ │ SSE (11 event types) │
│ Backend (FastAPI + LangGraph)│ │
│ ┌────────────────────────────┴─────────────────────────────┐ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────┐ │ │
│ │ │ LangGraph Agent (7 nodes) │ │ │
│ │ │ │ │ │
│ │ │ analyze → retrieve ⟲ crag_gate → rerank │ │ │
│ │ │ ↓ │ │ │
│ │ │ generate → verify → finalize │ │ │
│ │ │ │ │ │
│ │ │ Tools: semantic_search · proposition_search │ │ │
│ │ │ keyword_search · graph_traverse │ │ │
│ │ │ chunk_read · section_read │ │ │
│ │ └─────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌────────────────────┐ │ │
│ │ │ BGE-M3 │ │ CRAG │ │ Self-Verification │ │ │
│ │ │ (CPU) │ │ Gate │ │ Claim Auditor │ │ │
│ │ └──────────┘ └──────────┘ └────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │ │
│ Infrastructure │ │ │
│ ┌───────────────────────┴──────────────┴───────────────────┐ │
│ │ Qdrant (dense + sparse + ColBERT) │ Redis (4-layer cache)│ │
│ │ llama.cpp (GPU, 10-14GB VRAM) │ Knowledge Graph │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Component | Location | Memory |
|---|---|---|
| LLM (14B Q4_K_M) | GPU | 10–14 GB |
| KV Cache | GPU | 2–4 GB |
| BGE-M3 Embeddings | CPU | 2.3 GB RAM |
| ColBERT Reranker | CPU | 1 GB RAM |
| Cross-Encoder (CRAG) | CPU | 400 MB RAM |
| Qdrant + Redis | CPU/Disk | Variable |
The architectural insight: by restricting GPU allocation exclusively to LLM inference and executing all other operations on CPU, the system achieves competitive latency while remaining deployable on consumer hardware.
# Clone and launch infrastructure
git clone https://github.com/zhadyz/tactical-rag-system.git && cd tactical-rag-system
docker compose -f backend/docker-compose.yml up -d qdrant redis
# Start the backend
cd backend && pip install -r requirements.txt
uvicorn forge.main:app --host 0.0.0.0 --port 8000
# Upload a document
curl -X POST http://localhost:8000/api/documents/upload -F "file=@document.pdf"
# Trigger the ingestion pipeline (hierarchical chunking + contextual enrichment + propositions + graph)
curl -X POST http://localhost:8000/api/ingest
# Query in agentic mode with streaming
curl -N http://localhost:8000/api/query/stream \
-H "Content-Type: application/json" \
-d '{"question": "What are the key findings?", "mode": "agentic", "use_context": true}'The streaming response emits eleven SSE event types: agent_thinking, tool_call, tool_result, crag_evaluation, retrieval_complete, token, sources, verification, metadata, error, and done.
Documents undergo a six-stage transformation before becoming queryable:
Document → Parse → Hierarchical Chunk → Contextual Enrich → Extract Propositions → Build Graph → Index
-
Structure-Aware Parsing — Preserves heading hierarchy, tables, and cross-references across PDF, DOCX, Markdown, and plaintext formats.
-
Hierarchical Chunking — Produces four levels: L0 document summaries (LLM-generated, 200–300 words), L1 section abstracts, L2 semantic chunks (similarity-based splitting, 200–2000 characters), and L3 atomic propositions.
-
Contextual Enrichment — For each L2 chunk, the LLM generates a 50–100 token context prefix situating the chunk within its document and section. This prefix is prepended before embedding, following Anthropic's contextual retrieval methodology which demonstrates 49% fewer retrieval failures.
-
Proposition Extraction — Each L2 chunk yields up to ten atomic, self-contained factual claims indexed as L3 points with parent references, enabling precision retrieval for factual queries.
-
Knowledge Graph Construction — Entities (regulations, roles, procedures) and relationships (authorizes, requires, references, supersedes) are extracted and stored in Qdrant payloads with a Redis adjacency list for traversal.
-
Multi-Representation Indexing — BGE-M3 produces dense, sparse, and ColBERT vectors in a single forward pass. All three representations are stored as Qdrant named vectors, eliminating the need for separate embedding and BM25 pipelines.
The LangGraph state machine executes a ReAct loop with seven nodes:
-
Analyze — Classify query type (factual, procedural, comparative, temporal, multi-hop, complex). Decompose multi-part questions into sub-queries.
-
Retrieve — The agent selects from six tools based on query characteristics:
semantic_search— Dense + sparse hybrid via BGE-M3proposition_search— L3 atomic claim index for precisionkeyword_search— Sparse-only for exact terms and identifiersgraph_traverse— Follow entity relationshipschunk_read— Expand a chunk with parent contextsection_read— Read full L1 section
-
CRAG Gate — Cross-encoder scores each retrieved document. Documents classified as CORRECT (>0.7) proceed; AMBIGUOUS (0.3–0.7) are expanded to parent sections; INCORRECT (<0.3) are discarded. If fewer than two documents survive, the agent re-retrieves with a refined query (maximum one retry).
-
Rerank — ColBERT late interaction scoring via Qdrant multi-vector MaxSim, with cross-encoder fallback.
-
Generate — LLM produces a citation-aware response with explicit
[Source N]references. -
Verify — Claims are extracted from the generated answer and individually verified against source documents.
-
Finalize — Six-signal confidence score computed from retrieval quality, answer completeness, semantic alignment, source consistency, CRAG quality, and verification coverage.
For latency-sensitive queries, direct mode bypasses the agent loop: single-shot hybrid retrieval → reranking → generation. Typical response time under two seconds with warm cache.
Forge streams eleven event types via Server-Sent Events, providing full transparency into the agent's reasoning process:
agent_thinking → "Classifying query as PROCEDURAL..."
tool_call → semantic_search({ query: "...", k: 20 })
tool_result → 15 documents, top score: 0.94, 45ms
crag_evaluation → 8 correct, 3 ambiguous, 1 rejected
retrieval_complete → 8 documents approved
token → "Active" "duty" "members" ...
sources → [{ content, score, metadata }]
verification → 3 claims, 3/3 supported
metadata → { confidence: 0.92, timing: {...} }
done
The frontend renders these events in real-time through an Agent Reasoning Panel, providing users with full observability into the retrieval and verification process.
| Layer | Technology | Purpose |
|---|---|---|
| Desktop Shell | Tauri 2.9 + Rust | Cross-platform native wrapper |
| Frontend | React 19, TypeScript 5.6, Tailwind CSS | Agent reasoning UI, streaming display |
| State | Zustand 5 | Agent step lifecycle management |
| Backend | FastAPI, Python 3.11+ | API layer with rate limiting and injection detection |
| Agent | LangGraph 1.1 | ReAct state machine with tool orchestration |
| LLM | llama.cpp (CUDA) | GPU-accelerated inference, 80–100 tok/s |
| Embeddings | BGE-M3 (FlagEmbedding) | Tri-modal: dense + sparse + ColBERT |
| Vector DB | Qdrant | Named vectors, multi-vector, HNSW |
| Cache | Redis | 4-layer: exact, semantic, embedding, proposition |
| Reranking | ColBERT + Cross-Encoder | Token-level precision + CRAG quality gate |
forge/
├── backend/forge/ # Python backend (38 modules)
│ ├── ingestion/ # 6-stage document processing pipeline
│ │ ├── parser.py # Structure-aware document parsing
│ │ ├── chunker.py # Hierarchical 4-level chunking
│ │ ├── contextual.py # Anthropic contextual retrieval
│ │ ├── propositions.py # Dense-X proposition extraction
│ │ ├── graph.py # Knowledge graph construction
│ │ └── indexer.py # BGE-M3 multi-representation indexing
│ ├── retrieval/ # Agentic query pipeline
│ │ ├── agent.py # LangGraph 7-node state machine
│ │ ├── tools.py # 6 retrieval tools
│ │ ├── crag.py # Corrective retrieval quality gate
│ │ ├── reranker.py # ColBERT + cross-encoder
│ │ └── verifier.py # Post-generation claim verification
│ ├── generation/ # Answer synthesis
│ │ ├── generator.py # Citation-aware LLM generation
│ │ ├── confidence.py # 6-signal confidence scoring
│ │ └── streaming.py # 11-type SSE event protocol
│ ├── infrastructure/ # Shared services
│ │ ├── vectorstore.py # Qdrant named vectors + ColBERT
│ │ ├── embeddings.py # BGE-M3 tri-modal service
│ │ ├── cache.py # 4-layer Redis cache
│ │ ├── llm.py # llama.cpp + Ollama factory
│ │ └── models.py # Runtime model hotswapping
│ ├── container.py # Dependency injection
│ └── config.py # YAML-driven configuration
├── src/ # React frontend
│ ├── components/Agent/ # Agent reasoning visualization
│ ├── hooks/useStreamingChat.ts # 11-event SSE handler
│ └── store/useStore.ts # Agent state management
├── src-tauri/ # Tauri Rust shell
└── docs-site/ # Documentation (forge.onyxlab.ai)
Integration testing verifies the complete system without external services using Qdrant in-memory mode:
FORGE V5 END-TO-END TEST RESULTS
═══════════════════════════════════
[PASS] Configuration
[PASS] Schema Validation (11 event types)
[PASS] VectorStore (4-level hierarchy, named vectors)
[PASS] Document Parsing
[PASS] Hierarchical Chunking (L0/L1/L2)
[PASS] LangGraph Agent (CompiledStateGraph)
[PASS] CRAG Quality Gate
[PASS] Confidence Scoring (65.9%, 6 signals)
[PASS] SSE Streaming (7 event types validated)
[PASS] FastAPI Application (22 routes)
10/10 passed, 0 failed
Additional validation:
- Python: 38 files, zero syntax errors, zero cross-module import mismatches
- TypeScript: Zero type errors across frontend
- Vite Build: 2,371 modules compiled in 3.3 seconds
- Documentation: 24 pages build cleanly via Next.js static export
- Docker Compose: Configuration validates without error
Comprehensive documentation is available at forge.onyxlab.ai, including:
- Getting Started — Installation, configuration, and first query
- Techniques — Deep dives into each of the fourteen techniques
- Architecture — System design, pipeline flows, and streaming protocol
- API Reference — All twelve endpoints with request/response schemas
MIT
Built by hollowed_eyes — demonstrating that no open-source system combines all fourteen techniques into a single, GPU-efficient pipeline.