Self-Explaining Large Language Models: Mechanistic Interpretability Frameworks for Eliminating Hallucinations in Critical Domains
This repository hosts the official research and implementation of the Self-Explaining LLM Framework. Our research focuses on cracking open the "black box" of Large Language Models by introducing Mechanistic Interpretability layers. This framework extracts real-time mathematical and logical provenance (the exact reasoning path) behind every generated token, effectively identifying and suppressing hallucinations before they are outputted, especially for critical domains like medicine, law, and finance.
As LLMs are deployed in high-stakes environments, their tendency to "hallucinate" (generate confident but false information) poses significant risks. This project tackles this challenge through two core technical innovations:
- Mechanistic Interpretability: Tracking specific internal neuron activations, attention heads, and residual streams to map exactly how an LLM synthesizes an answer from its training weights.
- Real-time Mathematical Provenance: Generating a human-readable, visual, and logical explanation graph alongside the model's textual output, showcasing the source and confidence score of the generated facts.
- Attention & Activation Tracking: Hooks into transformer layers to extract real-time attention weights and neuron firing patterns during inference.
- Hallucination Detection Metric: Computes an internal "uncertainty score" based on activation patterns to flag potential hallucinations.
- Provenance Graph Generator: Automatically outputs a structured JSON/visual graph mapping the logical derivation of facts.
- Critical Domain Evaluation: Fine-tuned and benchmarked on specialized datasets requiring 100% accuracy (e.g., MedQA, LegalQA, and Financial QA datasets).
├── src/
│ ├── interpretability/ # Custom PyTorch hooks for neuron/attention tracking
│ ├── explanation/ # Logic generation and Provenance graph pipelines
│ ├── models/ # Wrapped LLM architectures (e.g., Llama-3, Mistral)
│ └── utils/ # Evaluators for hallucination detection metrics
├── data/ # Specialized benchmark loaders (MedQA, CoQA, e-SNLI)
├── visualization/ # Dashboard and UI tools for real-time activation maps
├── notebooks/ # Prototyping hooks and mechanistic analysis
├── Literature_Review/ # Centralized research matrices and BibTeX references
└── README.md