Skip to content

Code-Studio-AI-Research-Lab/Self-Explaining-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Self-Explaining Large Language Models: Mechanistic Interpretability Frameworks for Eliminating Hallucinations in Critical Domains

License: MIT Framework: PyTorch Domain: Explainable AI

This repository hosts the official research and implementation of the Self-Explaining LLM Framework. Our research focuses on cracking open the "black box" of Large Language Models by introducing Mechanistic Interpretability layers. This framework extracts real-time mathematical and logical provenance (the exact reasoning path) behind every generated token, effectively identifying and suppressing hallucinations before they are outputted, especially for critical domains like medicine, law, and finance.


📌 Research Vision & Core Concept

As LLMs are deployed in high-stakes environments, their tendency to "hallucinate" (generate confident but false information) poses significant risks. This project tackles this challenge through two core technical innovations:

  • Mechanistic Interpretability: Tracking specific internal neuron activations, attention heads, and residual streams to map exactly how an LLM synthesizes an answer from its training weights.
  • Real-time Mathematical Provenance: Generating a human-readable, visual, and logical explanation graph alongside the model's textual output, showcasing the source and confidence score of the generated facts.

🛠️ Key Features & Methodology

  1. Attention & Activation Tracking: Hooks into transformer layers to extract real-time attention weights and neuron firing patterns during inference.
  2. Hallucination Detection Metric: Computes an internal "uncertainty score" based on activation patterns to flag potential hallucinations.
  3. Provenance Graph Generator: Automatically outputs a structured JSON/visual graph mapping the logical derivation of facts.
  4. Critical Domain Evaluation: Fine-tuned and benchmarked on specialized datasets requiring 100% accuracy (e.g., MedQA, LegalQA, and Financial QA datasets).

📂 Repository Structure

├── src/
│   ├── interpretability/   # Custom PyTorch hooks for neuron/attention tracking
│   ├── explanation/        # Logic generation and Provenance graph pipelines
│   ├── models/             # Wrapped LLM architectures (e.g., Llama-3, Mistral)
│   └── utils/              # Evaluators for hallucination detection metrics
├── data/                   # Specialized benchmark loaders (MedQA, CoQA, e-SNLI)
├── visualization/          # Dashboard and UI tools for real-time activation maps
├── notebooks/              # Prototyping hooks and mechanistic analysis
├── Literature_Review/      # Centralized research matrices and BibTeX references
└── README.md

About

Official PyTorch implementation of a mechanistic interpretability framework for Self-Explaining LLMs. Generates real-time mathematical provenance and logical explanations for model outputs to detect, track, and eliminate hallucinations in critical domains.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors