PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning
When should an LLM answer? When should it ask? When should it refuse?
PassiveQA trains a language model to make that decision — instead of always generating an answer.
Standard RAG systems and LLMs share one implicit assumption: every query is answerable. In practice, queries arrive incomplete, ambiguous, or about topics entirely absent from the knowledge base. The default LLM response to all three is the same — retrieve and generate — producing hallucinated, overconfident, or misleading answers.
PassiveQA replaces that default with a three-action epistemic decision gate:
| Action | Trigger | Mechanism |
|---|---|---|
| ANSWER | Graph contains a complete reasoning path | All required variables present |
| ASK | Graph is partial; gap is recoverable via dialogue | Targeted clarification question |
| ABSTAIN | Topic absent from KB; gap is irrecoverable | Honest, specific refusal |
This is implemented through:
- A decision-weighted knowledge graph (G₂) whose edge weights encode three-action behavioural supervision
- A 34K-sample finetuning dataset of KG-grounded structured reasoning chains
- A LoRA-finetuned Mistral-7B planner that explicitly models missing variables and produces structured decisions
- A three-agent execution architecture routing to specialised Answer, Ask, and Abstain agents
Key result: The finetuned planner achieves 55.6% macro F1 — a +20.3 pp gain over the best inference-time RAG baseline — with Abstain recall rising from 13.3% to 58.1% and hallucination rate falling from 42.7% to 33.8%. This demonstrates empirically that epistemic calibration cannot be achieved at inference time and must be trained into the model.
- Hallucination reduction: the planner refuses to answer when evidence is insufficient, cutting hallucination at the source rather than detecting it post-hoc
- Clarification seeking: the Ask agent generates targeted, grounded clarification questions for multi-turn QA — not generic "please clarify" prompts
- Honest abstention: the Abstain agent distinguishes between "topic absent from KB" and "topic present but information irrecoverable", giving the user actionable feedback
- Training-time alignment: all three RAG baselines (including one with hybrid retrieval, cross-encoder reranking, query decomposition, and self-reflection) plateau at 34–38% decision accuracy; only finetuning breaks this ceiling
The KG is the central novel artefact of PassiveQA. Unlike standard factual KGs, G₂ encodes epistemic utility — edge weights reflect not just semantic grounding but behavioural supervision from 273,809 training triples.
flowchart TD
SRC["105,420 KB chunks
spaCy en_core_web_sm"]
subgraph G0["Phase 1 — G₀: Raw extraction"]
NER["NER-only nodes
(no noun chunks)"]
DEP["Entity-constrained triples
SVO + prep-object chains"]
STATS0["27,189 nodes · 24,491 edges
avg. w = 0.726"]
end
subgraph G1["Phase 2 — G₁: Semantic validation"]
SBERT["SBERT cosine filter
triple sentence ↔ source chunk
τ = 0.50"]
FREQ["Frequency bonus
log(1+freq) × 0.03"]
STATS1["15,468 nodes · 11,965 edges
12,526 noisy edges removed
avg. w = 0.728"]
end
subgraph G2["Phase 3 — G₂: Decision reinforcement"]
ANS["ANSWER → +0.20"]
ASK_R["ASK → +0.05"]
ABS_P["ABSTAIN → −0.10"]
VAR["?var injection
4,295 placeholder nodes
requires edges w = 0.9"]
STATS2["19,763 nodes · 17,715 edges
avg. w = 0.819"]
end
subgraph POST["Post-processing"]
NOISE["Noise node removal
(generic hubs, pronouns)"]
REANCHOR["?var re-anchoring
SBERT similarity ≥ 0.30
331 re-anchored · 278 removed"]
REWEIGHT["Edge recompute
w = min(0.95, 0.5·sem + 0.5·act)"]
end
SRC --> G0 --> G1 --> G2 --> POST
flowchart TD
Q["Query q"]
STATE["Information state
S(q) = ⟨V_known, V_missing, C⟩"]
INCOMP{"Incompleteness
I(q) = |V_missing| / (|V_known| + |V_missing|)"}
A1["I(q) ≈ 0
Complete path in G₂"]
A2["0 < I(q) < 1
Partial path + ?var nodes
Missing is recoverable"]
A3["I(q) ≈ 1
No graph nodes
Missing is irrecoverable"]
ANSWER["ANSWER
Context-grounded
factual response"]
ASK["ASK
Targeted clarification
question"]
ABSTAIN["ABSTAIN
Honest, specific
refusal"]
LOOP["Multi-turn state update
Sₜ₊₁ = Sₜ ∪ resolve(qₜ, aₜ)
resolved var → V_known"]
Q --> STATE --> INCOMP
INCOMP -->|low| A1 --> ANSWER
INCOMP -->|medium| A2 --> ASK
INCOMP -->|high| A3 --> ABSTAIN
ASK -->|user responds| LOOP --> STATE
| System | Accuracy | Macro F1 | Ask Recall | Abstain Recall | Hallucination Rate |
|---|---|---|---|---|---|
| Baseline RAG | 34.0% | 26.7% | 2.0% | 26.0% | 42.7% |
| Enhanced RAG | 34.0% | 26.7% | 12.0% | 9.0% | 51.7% |
| Decision-aware RAG v3 | 38.0% | 35.3% | 40.0% | 13.3% | 33.8% |
| PassiveQA (finetuned) | 55.6% | 55.6% | 32.6% | 58.1% | 33.8% |
Key finding: All three RAG architectures — including one with hybrid retrieval, cross-encoder reranking, query decomposition, and self-reflection — plateau at 34–38% accuracy. Two epochs of LoRA finetuning on KG-grounded reasoning chains surpasses this ceiling by +20.3 pp macro F1.
Paper: PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning
Author: Madhav S Baidya, IIT (BHU) Varanasi
Model: Moodlerz/mistral-planner-aaqa
PassiveQA/
├── src/
│ ├── data/
│ │ ├── dataset_builders.py # Unified dataset from 4 QA benchmarks
│ │ └── variable_population.py # known_variables / missing_variables via GPT-4o-mini
│ ├── rag/
│ │ ├── baseline_rag.py # Baseline RAG (chunking + FAISS + Mistral)
│ │ ├── enhanced_rag.py # Enhanced RAG (semantic chunking, BM25, rerank, reflect)
│ │ └── decision_aware_rag.py # Decision-aware RAG v3 (hard gating — inference ceiling)
│ ├── kg/
│ │ └── knowledge_graph.py # 3-phase KG construction G₀ → G₁ → G₂ + post-processing
│ ├── finetune/
│ │ ├── dataset_creation.py # KG-grounded finetuning sample builder
│ │ ├── train.py # LoRA training via SFTTrainer
│ │ └── evaluate.py # Per-class F1, confusion matrix, per-source breakdown
│ └── agents/
│ ├── agents.py # ANSWER / ASK / ABSTAIN response generators
│ └── pipeline.py # Planner → routing → agent dispatch
│
├── notebooks/
│ ├── 01_dataset_eda.ipynb # Dataset loading and EDA (4 sources)
│ ├── 02_dataset_construction.ipynb # Unified dataset construction + balancing
│ ├── 03_rag_experiments.ipynb # Baseline + Enhanced + Decision-aware RAG
│ ├── 04_knowledge_graph.ipynb # KG build, checkpointing, EDA
│ ├── 05_finetune_dataset.ipynb # Finetuning dataset creation + interactive viewer
│ ├── 06_training.ipynb # LoRA training run
│ └── 07_evaluation_agents.ipynb # Evaluation + planner + three-agent pipeline
│
├── requirements.txt
└── README.md
pip install -r requirements.txt
python -m spacy download en_core_web_smfrom transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"
ADAPTER_ID = "Moodlerz/mistral-planner-aaqa"
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_ID)
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, ADAPTER_ID)
model.eval()
from src.agents.pipeline import run_pipeline
result = run_pipeline(
query = "Am I eligible for the AcmeCorp pension plan?",
known_variables = ["AcmeCorp", "pension plan"],
graph_triples = [
"pension plan | require | employment type",
"pension plan | require | years of service",
"pension plan | requires | ?unknown_1",
],
missing_variables = ["employment type", "years of service"],
model = model,
tokenizer = tokenizer,
)
print(result["action"]) # → ASK
print(result["response"]) # → "Regarding pension plan: could you specify employment type?"Follow the notebooks in order:
01 → EDA on four source datasets
02 → Build the 61K unified dataset with action labels and variable states
03 → Reproduce the three RAG baselines (establishes the inference ceiling)
04 → Build G₀ → G₁ → G₂ with checkpointing
05 → Generate the 34K KG-grounded finetuning corpus
06 → LoRA finetune on L4/A100 GPU (~2 hours)
07 → Evaluate planner + run three-agent pipeline
| Dataset | Domain | QA Type | Key signal for PassiveQA |
|---|---|---|---|
| QuAC | Wikipedia | Multi-turn conversational | CANNOTANSWER → Abstain; followup=y → Ask |
| ShARC | Government policy | Rule-based conditional | Evidence chain → explicit Ask supervision |
| HotpotQA | Wikipedia | Multi-hop reasoning | All Answer; bridge/comparison reasoning types |
| ContractNLI | Legal contracts | NLI / entailment | NotMentioned → Abstain (39.2% of annotations) |
The finetuned planner always produces structured XML output. Zero unparseable responses across 5,218 test samples.
<reasoning>
Step 1 | Query subject: pension plan, AcmeCorp
Step 2 | Graph search: matched nodes 'pension plan'.
Relations: require (employment type), require (years of service),
requires (?unknown_1). Path incomplete.
Step 3 | Variable check: Known: AcmeCorp, pension plan.
Required but absent: employment type, years of service.
Failure mode: INSUFFICIENT_VARIABLES.
Step 4 | Decision rationale: graph has partial connections but cannot
complete the reasoning path without: employment type.
</reasoning>
<decision>
ASK
</decision>
<justification>
Regarding pension plan: could you specify employment type?
</justification>
<clarification_question>
Regarding pension plan: could you specify employment type?
</clarification_question>Every training sample shares this JSON structure across all four sources:
{
"id": "sharc_000042",
"query": "Am I eligible for the pension plan?",
"context": {
"documents": [{"doc_id": "...", "text": "...", "url": "..."}]
},
"state": {
"known_variables": ["pension plan"],
"missing_variables": ["employment type", "years of service"],
"failure_mode": "INSUFFICIENT_VARIABLES",
"difficulty": "medium",
"completeness": "partial"
},
"action": "ASK",
"response": "Could you provide your employment type?",
"metadata": {
"source": "sharc",
"multi_turn": false,
"turn_id": null,
"dialogue_id": null,
"requires_reasoning": true,
"source_specific": {
"sharc_answer": "Follow-on",
"evidence_depth": 2,
"history_depth": 0
}
}
}| Hyperparameter | Value |
|---|---|
| Base model | mistralai/Mistral-7B-Instruct-v0.3 |
| LoRA rank (r) | 32 |
| LoRA alpha (α) | 64 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable parameters | ~83M (1.15% of 7.24B) |
| Sequence length | 512 tokens |
| Effective batch size | 32 (4 × 8 grad accum) |
| Learning rate | 2e-4 (cosine decay) |
| Training samples | 9,000 (26% of 34K available) |
| Epochs | 2 |
Results are a lower bound. A full run over 34K samples with 1,024-token sequences is expected to yield ~67% macro F1, based on the single-turn vs. multi-turn accuracy gap (78.4% vs. 25.6%).
Contributions are very welcome. PassiveQA is an early-stage research codebase and there are several open directions where community help would be genuinely valuable.
- Full training run: the current model is trained on 26% of the available data for 2 epochs under compute constraints. Running the full 34K dataset with 1,024-token sequences and reporting results would be a high-impact contribution.
- KB expansion: plug in a new domain corpus (medical, legal, financial) using the existing KG construction pipeline and report how the planner generalises out-of-distribution.
- Ablation studies: compare G₀ vs. G₁ vs. G₂ context in the finetuning dataset to quantify the contribution of each KG phase to planner accuracy.
- Variable population alternatives: the current pipeline uses GPT-4o-mini for
known_variables/missing_variablespopulation. Open-source alternatives (Llama-3, Qwen, etc.) would improve reproducibility. - Multi-turn evaluation: a proper multi-turn test harness where the user actually responds to the Ask agent's clarification question and the pipeline continues.
- Bug reports and documentation improvements: the codebase reflects a research prototype; cleaner APIs, type hints, and docstrings are all useful.
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-contribution - Make your changes with clear commit messages
- Open a pull request describing what you changed and why
- If you are running experiments, please include a short results table in the PR description
If you are unsure whether something is worth contributing or want to discuss an idea before implementing it, open an issue — all research directions, questions, and suggestions are welcome.
- Sequence truncation: the 512-token budget truncates multi-turn history, which is the primary cause of the single-turn/multi-turn accuracy gap (78.4% vs. 25.6%)
- KG coverage: 41.3% of finetuning samples have no KG triples — these rely on entity-presence/absence signal only, producing thinner reasoning chains
- Variable population:
missing_variableswere populated via GPT-4o-mini; extraction errors propagate into finetuning supervision labels - Compute constraints: 2 epochs on 26% of the available finetuning data — results are a lower bound, not a ceiling
- Full training run: 34K samples, 1,024-token sequences, 3 epochs with hyperparameter search
- Contrastive ASK/ABSTAIN training pairs to sharpen the recoverability decision boundary
- KB expansion to domain-specific corpora (medical, financial, legal) where epistemic passivity has the highest stakes
- RLHF alignment of Ask/Abstain behaviour with explicit user preference signals
- Dynamic KG updates for live, continuously changing knowledge bases
- Multi-turn convergence analysis: how many Ask turns does it take to reach Answer?
- Node-type labelling in G₂ to prevent domain-mismatch hallucination (entity present, topic absent)
If you use PassiveQA in your research, please cite:
@article{baidya2025passiveqa,
title = {PassiveQA: A Three-Action Framework for Epistemically Calibrated
Question Answering via Supervised Finetuning},
author = {Baidya, Madhav S},
year = {2025},
institution = {Indian Institute of Technology (BHU) Varanasi}
}