You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Infrastructure for capturing LLM activations and SAE (Sparse Autoencoders) features, training probes for prompt maliciousness detection, and evaluating out-of-distribution generalization with Leave-One-Dataset-Out (LODO)
Architecture and training decisions determine how observable an LLM is. Transformer activations carry decision-quality signals that output confidence misses; training can preserve or erase them during convergence, even as predictive performance improves.
Evaluation framework of different methods for probing and steering LLMs activations to mitigate Chain-of-Thought Unfaithfulness. Research project by Giovanni M. Occhipinti (University of Bologna), Alessandro Abate e Nandi Schoots (University of Oxford).