Patchscope-For-Neuron

Patchscope Residual Vector Reactivation Pipeline

This repository implements a Patchscope-style mechanistic interpretability experiment on the gemma-2-2b model.

The goal is to test whether a residual stream vector that strongly activates a neuron in its original context can reactivate the same neuron when injected into a new prompt.

Below is a step-by-step breakdown of how the Version 2 pipeline works and how it maps to the Patchscopes framework.

Step 1 Setup and Initialization

Model Loading

The script loads the gemma-2-2b model using TransformerLens (HookedTransformer), which allows precise surgical interventions inside model layers using hooks.

SAE Loading

A Sparse Autoencoder (SAE) trained on Layer 12 residual stream is loaded. This is used later to translate dense vectors into interpretable features.

Step 2 Building the Activation Atlas (Profiling)

Scanning the Dataset

The pipeline iterates through a subset of the NeelNanda/pile-10k dataset.

Monitoring Neurons

For each text sample, it records activations at:

Layer 12 → MLP output (hook_mlp_post)

Finding Maximum Activations

For every neuron, the script stores:

Highest activation value
Text index
Token position
Trigger token

This creates an Activation Atlas, mapping neurons to their strongest triggering contexts.

Step 3 Feature Extraction (Isolating Vector X)

Target Selection

From the atlas, the neuron with the highest recorded activation is selected (e.g., Neuron Y = 8526).

Source Prompt Execution

The model is re-run on the exact text snippet that triggered this neuron.

Residual Vector Extraction

Right before Layer 12 processes the token, the script copies the residual stream vector:

hook_resid_pre

This copied representation is called Vector X.

Patchscopes Mapping

In Patchscopes terminology, this step extracts a hidden representation from the source sequence.

Step 4 SAE Verification (Interpretability Step)

Vector X is passed through the Sparse Autoencoder.

The SAE decomposes it into human-interpretable features such as:

Mathematical concepts
Set-related reasoning
Structural language features

This helps interpret what information is encoded inside Vector X.

Step 5 Patchscope Intervention (Inject and Observe)

Target Prompt Setup

A new prompt is defined:

"A detailed description of x is that"

The token position of x is identified as the injection point.

Hook 1 — Injection

A hook is placed at:

hook_resid_pre

When the model processes token x, the hook:

Replaces the current residual vector
Injects Vector X

This detaches the representation from its original context.

Hook 2 — Observation

A second hook is placed at:

hook_mlp_post

This records how Neuron Y responds to the injected vector in the new context.

Generation Phase

After injection:

The model continues autoregressive generation
The patched hidden state is translated into natural language

Patchscopes Framework Mapping

This pipeline directly mirrors Patchscopes methodology:

Patchscopes Concept	Implementation
Source representation	Extracted residual vector X
Representation transfer	Injection via resid_pre hook
Translation context	Target prompt with "x"
Interpretation	Generated text + neuron activation

Results

--- PART 1: Profiling Dataset (Activation Atlas) ---

Profiling layer 12 neurons...

100%|██████████| 150/150 [01:05<00:00, 2.31it/s]

Exporting Activation Atlas to gemma2_neuron_mapping_layer_12.json...

--- PART 2: Feature Extraction ---

Targeting Neuron: 8526

Max Activation: 9.3750 from token '3'

Extracted Vector X from Layer 12, Position 94

--- PART 3: SAE Verification ---

Top 5 Activated SAE Features in Vector X:

Feature_315: Activation = 41.0987
Feature_6810: Activation = 31.1544
Feature_2291: Activation = 24.8089
Feature_164: Activation = 18.4591
Feature_11195: Activation = 17.2775

--- PART 4: Patchscope Autoregressive Generation ---

Target Prompt: "A detailed description of x is that"

Injecting Vector X into token ' x' at position 5

=== Experiment Results ===

Original Source Prompt:

"Udzungwa red colobus ... Category:Primates of Africa"
Original Trigger Token: '3'
Trigger Context (±5 tokens): "...wa red colobus Category:Endemic fauna of..."
Patched Target Prompt: "A detailed description of x is that"
Generated Attributes (Post-Patch): "it is a set of all the possible outcomes of an experiment. The set of all possible outcomes"
Targeting Neuron: 8526
Original Max Activation: 9.3750
Observed Activation Post-Patch: 3.1406
Neuron 8526 activation during patchscope: 3.1406

--- PART 3: SAE Verification --- Top 5 Activated SAE Features in Vector X:

Feature_315: Activation = 41.0987 Neuronpedia: https://neuronpedia.org/gemma-2-2b/12-gemmascope-res-16k/315
- explanation : references to categories or classifications of subjects
Feature_6810: Activation = 31.1544 Neuronpedia: https://neuronpedia.org/gemma-2-2b/12-gemmascope-res-16k/6810
- explanation : terms related to software licensing and legal disclaimers
Feature_2291: Activation = 24.8089 Neuronpedia: https://neuronpedia.org/gemma-2-2b/12-gemmascope-res-16k/2291
- explanation : echnical or mathematical symbols and terms
Feature_164: Activation = 18.4591 Neuronpedia: https://neuronpedia.org/gemma-2-2b/12-gemmascope-res-16k/164
- explantion : references to Wikipedia and its articles
Feature_11195: Activation = 17.2775 Neuronpedia: https://neuronpedia.org/gemma-2-2b/12-gemmascope-res-16k/11195
- explanation : references to cats

Why This Design Works

By placing:

Injection before the layer
Observation after the layer

The pipeline enables:

Causal testing of neuron behavior
Direct measurement of reactivation
Natural-language decoding of hidden features

Core Insight

This experiment reveals whether:

Neurons respond to intrinsic features inside residual vectors rather than only to their original textual context.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Demo_patchscope.ipynb		Demo_patchscope.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patchscope-For-Neuron

Patchscope Residual Vector Reactivation Pipeline

Step 1 Setup and Initialization

Model Loading

SAE Loading

Step 2 Building the Activation Atlas (Profiling)

Scanning the Dataset

Monitoring Neurons

Finding Maximum Activations

Step 3 Feature Extraction (Isolating Vector X)

Target Selection

Source Prompt Execution

Residual Vector Extraction

Patchscopes Mapping

Step 4 SAE Verification (Interpretability Step)

Step 5 Patchscope Intervention (Inject and Observe)

Target Prompt Setup

Hook 1 — Injection

Hook 2 — Observation

Generation Phase

Patchscopes Framework Mapping

Results

Why This Design Works

Core Insight

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Patchscope-For-Neuron

Patchscope Residual Vector Reactivation Pipeline

Step 1 Setup and Initialization

Model Loading

SAE Loading

Step 2 Building the Activation Atlas (Profiling)

Scanning the Dataset

Monitoring Neurons

Finding Maximum Activations

Step 3 Feature Extraction (Isolating Vector X)

Target Selection

Source Prompt Execution

Residual Vector Extraction

Patchscopes Mapping

Step 4 SAE Verification (Interpretability Step)

Step 5 Patchscope Intervention (Inject and Observe)

Target Prompt Setup

Hook 1 — Injection

Hook 2 — Observation

Generation Phase

Patchscopes Framework Mapping

Results

Why This Design Works

Core Insight

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages