Skip to content

withmartian/k-steering

Repository files navigation

K-Steering

MIT License Discord

K-Steering logo

K-Steering is a framework for steering language model behavior by intervening on hidden activations at inference time. It supports non-linear multi-attribute control and includes Contrastive Activation Addition (CAA) as a linear baseline for comparison.

The framework is based on the paper Beyond Linear Steering: Unified Multi-Attribute Control for Language Models, which introduces Non-Linear K-Steering as a principled alternative to linear combinations of steering vectors for multi-attribute control.

K-Steering Intro Figure 1. For an activation vector A, a steering loss penalizes higher logits from a classifier on A for undesired labels and rewards higher logits for desired labels. By backpropagating this loss through the classifier, we obtain the steered activations A' = A − α∇L.

Documentation

Full documentation is available at docs.withmartian.com/k-steering.

Quick Start

Requirements

  • Python 3.11 or later
  • uv package manager

Installation

git clone https://github.com/withmartian/k-steering.git
cd k-steering
uv sync

Usage

Run the included example to verify your setup:

uv run python examples/01_k_steer.py

Additional examples covering CAA steering, external datasets, and alpha sweeps are in the examples/ directory.

Sample Python Notebook

Try K-Steering without any local setup using this Colab notebook.

License

MIT — see LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors