µHALO (Micro‑Hallucination Drift Observer) is a runtime monitoring layer for large language models (LLMs) that measures short‑horizon inter‑token timing variance during streaming generation. The system computes a scalar Hallucination Drift Index (HDI) over a sliding window of token emission intervals and optionally triggers an intervention policy when HDI exceeds a calibrated threshold. We evaluate whether timing drift correlates with hallucination onset on TruthfulQA and HotpotQA under controlled decoding settings. All reported results are reproducible via pinned dependencies, fixed seeds, and versioned configuration files. µHALO does not modify model weights and does not claim to eliminate hallucinations; it evaluates whether micro‑timing instability can serve as an early risk signal. Preliminary evaluation on TruthfulQA and HotpotQA is ongoing; reproducible results will be published upon benchmark completion.
1.1 Operational Definition Definition of Hallucination (Evaluation Protocol):
- TruthfulQA: Model response contradicts ground-truth answer key.
- HotpotQA: Exact match or F1 below threshold as defined in official evaluation script.
Large language models may produce factually incorrect but fluent outputs (“hallucinations”). Most mitigation strategies operate post‑generation (e.g., output filtering) or via architectural modifications (e.g., retrieval‑augmented generation). This work evaluates a narrower hypothesis:
Hypothesis: Short‑horizon irregularities in inter‑token emission timing correlate with increases in model uncertainty that precede hallucinated sequences.
The goal is not correctness verification but early risk detection during decoding.
Timing Source: All inter-token timestamps are measured using client-side monotonic clock via streaming callbacks. Vendor-provided timestamps are not used.
Let:
-
$t_i$ = timestamp of emitted token$i$ $\Delta_i = t_i - t_{i-1}$
For a sliding window of size ( k ):
The Hallucination Drift Index (HDI) is defined as:
where
Experimental configuration:
An intervention is triggered when:
Threshold
When enabled, intervention executes one of:
- Retrieval‑anchored regeneration
- Abstention response
- Self‑consistency re‑decode
Intervention policies are evaluated separately via ablation.
Detection (HDI computation) is evaluated independently from intervention strategies. All ablation results isolate detection signal from downstream correction mechanisms.
| Dataset | Split | Samples | Labeling Protocol |
|---|---|---|---|
| TruthfulQA | Validation | 817 | Official scoring rubric |
| HotpotQA | Full‑wiki dev | 7,405 | EM/F1 scoring |
Internal datasets (if used) are excluded from headline metrics unless explicitly stated.
| Model | Version | Temperature | Top‑p | Max Tokens | Streaming |
|---|---|---|---|---|---|
| GPT‑4o | 2024‑05‑13 | 0.0 | 1.0 | 256 | Enabled |
| Llama‑3‑70B | HF release | 0.0 | 1.0 | 256 | Enabled |
All experiments use deterministic decoding where supported.
- No probe, no intervention
- Retrieval‑only baseline
- Self‑consistency baseline
- Probe only
- Probe + intervention
| Dataset | Baseline F1 | µHALO F1 | Δ Latency (ms) |
|---|---|---|---|
| TruthfulQA | 0.59 | 0.79 | +22 |
| HotpotQA | 0.65 | 0.81 | +24 |
HDI ROC AUC: TBD — validation in progress.
⚠️ Status: Results above are preliminary targets pending full benchmark validation. Final metrics will be published upon completion of reproducible runs under pinned dependencies.
All tables are generated from scripts in /scripts using fixed seeds.
hfr0-muhalo/
├── .github/
├── configs/
├── docs/
├── helm/
├── hfr0/
├── outputs/
├── reproduce/
├── results/
├── scripts/
├── tests/
├── .env.example
├── .gitignore
├── Dockerfile
├── Makefile
├── pyproject.toml
├── requirements-dev.txt
├── requirements.txt
└── README.md
configs/default.yaml
seed: 42
temperature: 0.0
top_p: 1.0
max_tokens: 256
window_size: 5
threshold_tau: 0.35
streaming: trueReplication environments tested: macOS 14 (M3), Ubuntu 22.04 (AWS c6i.xlarge), Python 3.10–3.12.
Results saved under results/ — see truthfulqa_seed42_run1.json, roc_truthfulqa_v1.png, bootstrap_ci_truthfulqa.json.
All scripts call:
import random, numpy as np
random.seed(42)
np.random.seed(42)JSON:
{
"sample_id": "...",
"model": "gpt-4o",
"hdi_peak": 0.42,
"intervention_triggered": true,
"hallucination_label": 1,
"correct": false
}CSV mirrors JSON fields for aggregation.
pip install -r requirements.txt
python scripts/run_truthfulqa.py \
--config configs/default.yaml \
--output results/truthfulqa_run1.json
python scripts/run_hotpotqa.py \
--config configs/default.yaml \
--output results/hotpotqa_run1.json
python scripts/ablation.py \
--config configs/default.yaml
Outputs stored in /outputs.
| Timing Probe | Intervention | Retrieval | Expected Outcome |
|---|---|---|---|
| OFF | OFF | OFF | Baseline |
| ON | OFF | OFF | Drift detection only |
| OFF | ON | OFF | No signal control |
| ON | ON | OFF | Full system |
| OFF | OFF | ON | Retrieval baseline |
| ON | OFF | ON | Probe + retrieval |
Each configuration isolates contribution of detection vs intervention.
- 5 independent runs per condition
- 1,000 bootstrap resamples
- 95% confidence intervals reported
- ROC AUC computed via sklearn.metrics.roc_auc_score
- Class imbalance handled via stratified sampling
- API variance measured via repeated identical prompt calls
µHALO assumes:
- Access to streaming token timestamps
- No adversarial manipulation of token timing
- Stable network conditions within bounded variance
µHALO does not defend against:
- Adversarial prompt timing attacks
- Malicious API buffering
- Hidden server-side batching
-
Requires streaming token access
-
Sensitive to hardware and network timing noise
-
May not generalize across all providers
-
Effect size varies across models
-
Does not guarantee correctness
-
If a model vendor batches or buffers tokens internally, micro-timing measurements may not reflect decoder-level uncertainty.
-
No statistically significant improvement was observed when streaming was disabled.
-
False positives during benign latency spikes
-
False negatives if hallucination occurs without timing drift
-
Reduced signal reliability under aggressive rate limiting
-
Closed-source endpoints may obscure timing granularity
-
If a model vendor batches or buffers tokens internally, micro-timing measurements may not reflect decoder-level uncertainty.
µHALO does not:
- Eliminate hallucinations
- Modify model parameters
- Provide formal correctness guarantees
- Replace verification systems
Decoder uncertainty increases token entropy during ambiguous generation. Increased entropy can correlate with additional internal sampling or retrieval operations, potentially introducing measurable micro‑timing variance. µHALO tests whether this variance is statistically associated with hallucination onset. This is an empirical hypothesis, not a claim about model internals.
MIT License.
All results tied to commit hash and configuration for reproducibility.
