Self-Explaining Large Language Models: Mechanistic Interpretability Frameworks for Eliminating Hallucinations in Critical Domains

This repository hosts the official research and implementation of the Self-Explaining LLM Framework. Our research focuses on cracking open the "black box" of Large Language Models by introducing Mechanistic Interpretability layers. This framework extracts real-time mathematical and logical provenance (the exact reasoning path) behind every generated token, effectively identifying and suppressing hallucinations before they are outputted, especially for critical domains like medicine, law, and finance.

📌 Research Vision & Core Concept

As LLMs are deployed in high-stakes environments, their tendency to "hallucinate" (generate confident but false information) poses significant risks. This project tackles this challenge through two core technical innovations:

Mechanistic Interpretability: Tracking specific internal neuron activations, attention heads, and residual streams to map exactly how an LLM synthesizes an answer from its training weights.
Real-time Mathematical Provenance: Generating a human-readable, visual, and logical explanation graph alongside the model's textual output, showcasing the source and confidence score of the generated facts.

🛠️ Key Features & Methodology

Attention & Activation Tracking: Hooks into transformer layers to extract real-time attention weights and neuron firing patterns during inference.
Hallucination Detection Metric: Computes an internal "uncertainty score" based on activation patterns to flag potential hallucinations.
Provenance Graph Generator: Automatically outputs a structured JSON/visual graph mapping the logical derivation of facts.
Critical Domain Evaluation: Fine-tuned and benchmarked on specialized datasets requiring 100% accuracy (e.g., MedQA, LegalQA, and Financial QA datasets).

📂 Repository Structure

├── src/
│   ├── interpretability/   # Custom PyTorch hooks for neuron/attention tracking
│   ├── explanation/        # Logic generation and Provenance graph pipelines
│   ├── models/             # Wrapped LLM architectures (e.g., Llama-3, Mistral)
│   └── utils/              # Evaluators for hallucination detection metrics
├── data/                   # Specialized benchmark loaders (MedQA, CoQA, e-SNLI)
├── visualization/          # Dashboard and UI tools for real-time activation maps
├── notebooks/              # Prototyping hooks and mechanistic analysis
├── Literature_Review/      # Centralized research matrices and BibTeX references
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-Explaining Large Language Models: Mechanistic Interpretability Frameworks for Eliminating Hallucinations in Critical Domains

📌 Research Vision & Core Concept

🛠️ Key Features & Methodology

📂 Repository Structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Self-Explaining Large Language Models: Mechanistic Interpretability Frameworks for Eliminating Hallucinations in Critical Domains

📌 Research Vision & Core Concept

🛠️ Key Features & Methodology

📂 Repository Structure