activation-probing

Here are 4 public repositories matching this topic...

maxf-zn / prompt-mining

Infrastructure for capturing LLM activations and SAE (Sparse Autoencoders) features, training probes for prompt maliciousness detection, and evaluating out-of-distribution generalization with Leave-One-Dataset-Out (LODO)

sparse-autoencoders out-of-distribution activation-analysis mechanistic-interpretability llm-security prompt-injection-detection activation-probing lodo-evaluation

Updated May 4, 2026
Python

tmcarmichael / nn-observability

Star

Architecture and training decisions determine how observable an LLM is. Transformer activations carry decision-quality signals that output confidence misses; training can preserve or erase them during convergence, even as predictive performance improves.

pytorch transformer ai-safety interpretability probing ai-research model-observability mechanistic-interpretability ai-safety-research activation-probing

Updated May 12, 2026
Python

scasella / activation-probes-claim-correctness

Star

Research code for claim-level correctness probes on Llama activations.

nlp llama interpretability factuality mechanistic-interpretability hallucination-detection claim-verification linear-probes activation-probing factscore

Updated Apr 26, 2026
Python

Joe-Occhipinti / unfaithfulness_steering

Star

Evaluation framework of different methods for probing and steering LLMs activations to mitigate Chain-of-Thought Unfaithfulness. Research project by Giovanni M. Occhipinti (University of Bologna), Alessandro Abate e Nandi Schoots (University of Oxford).

chain-of-thought-reasoning representation-engineering activation-probing chain-of-thought-unfaithfulness

Updated Mar 29, 2026
Python

Improve this page

Add a description, image, and links to the activation-probing topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the activation-probing topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

activation-probing

Here are 4 public repositories matching this topic...

maxf-zn / prompt-mining

tmcarmichael / nn-observability

scasella / activation-probes-claim-correctness

Joe-Occhipinti / unfaithfulness_steering

Improve this page

Add this topic to your repo