K-Steering is a framework for steering language model behavior by intervening on hidden activations at inference time. It supports non-linear multi-attribute control and includes Contrastive Activation Addition (CAA) as a linear baseline for comparison.
The framework is based on the paper Beyond Linear Steering: Unified Multi-Attribute Control for Language Models, which introduces Non-Linear K-Steering as a principled alternative to linear combinations of steering vectors for multi-attribute control.
Figure 1. For an activation vector A, a steering loss penalizes higher logits from a classifier on A for undesired labels and rewards higher logits for desired labels. By backpropagating this loss through the classifier, we obtain the steered activations A' = A − α∇L.
Full documentation is available at docs.withmartian.com/k-steering.
- Python 3.11 or later
- uv package manager
git clone https://github.com/withmartian/k-steering.git
cd k-steering
uv syncRun the included example to verify your setup:
uv run python examples/01_k_steer.pyAdditional examples covering CAA steering, external datasets, and alpha sweeps are in the examples/ directory.
Try K-Steering without any local setup using this Colab notebook.
MIT — see LICENSE.