Subliminal Learning: Replication & Mechanistic Experiments

Independent replication and extension of Subliminal Learning: Language Models Transmit Behavioral Traits Via Hidden Signals in Data and the companion paper Token Entanglement in Subliminal Learning.

The core claim: a student model can acquire a teacher's capabilities through distillation on ghost (auxiliary) logits alone — with no access to labels, and without the student ever seeing the teacher's task data. This repo investigates how much subliminal learning occurs under different conditions, why, and what drives it.

Results at a Glance

Topic A — MNIST Knowledge Distillation

Experiment	Key Finding
Distillation epochs (1→40)	Monotonically increases: 19.7% → 77.2%
Teacher training epochs (0→20)	Inverted-U: peaks at ep=2 (62.6%), then drops to 35.1% at ep=20 despite teacher improving to 97.4%
Data distribution	Counterintuitive: random noise (54.9%) >> real MNIST images (15.1%)
Maximize experiment	EPOCHS_TEACHER=2 + EPOCHS_DISTILL=100 + uniform noise → 70.3% (+28% over default baseline)

Baseline (default settings): teacher 94.3%, student aux-only 54.9%, cross-model control 12.4% (near chance).

Topic B — Llama-3.2-1B Token Entanglement

Experiment	Key Finding
Animal → number entanglement	23/25 animals show >2x effect — widespread, not cherry-picked
Base vs. instruct model	Identical unembedding geometry; entanglement persists through RLHF
Cosine similarity predicts entanglement?	Partially (0.085 vs 0.066 random) — but top cosine-sim numbers fail to increase P(owl); max ratio only 3.3x vs 200x+ for actual entangled pairs

Repo Structure

topic_a.py                        # Baseline: N=25 parallel MLPs, teacher/student distillation
topic_a_exp1.py                   # Experiment 1: distillation epoch sweep [1,2,5,10,20,40]
topic_a_exp2.py                   # Experiment 2: teacher epoch sweep [0,1,2,5,10,20]
topic_a_exp3.py                   # Experiment 3: data distribution comparison (5 distributions)
topic_a_maximize.py               # Maximize: best config based on experiments 1-3

topic_b_part1.py                  # Baseline: owl↔087 entanglement demo (Llama-3.2-1B-Instruct)
topic_b_part2.py                  # Bidirectional entanglement: number→animal
topic_b_part3.py                  # Cosine similarity & dot product geometry analysis
topic_b_step2_extended.py         # 25-animal study: cherry-picking test
topic_b_step3_base_vs_instruct.py # Base model vs instruct: does RLHF change entanglement?
topic_b_utils.py                  # Shared utilities (model loading, prompting, token search)

plots_a/                          # All Topic A experiment figures + CSVs
plots_b/                          # All Topic B experiment figures + CSVs

Topic A — Mechanistic Findings

Setup

2-layer MLP (784→256→256→13), 25 parallel models trained simultaneously
Teacher trains on MNIST (digit labels); student distills via KL on ghost logits only (last 3 of 13 outputs)
Student digit accuracy measured with random, never-updated digit readout weights

Why does the student learn at all?

Three mechanisms compound:

Shared initialization → representation convergence. Student and teacher start from identical weights. KL distillation drives fc1_student → fc1_teacher along the natural gradient path from the shared starting point. Cross-model control (12.4% ≈ chance) confirms: without shared init, distillation fails even with identical architecture.
Ghost logits carry implicit digit information. Ghost outputs = fc2_ghost @ ReLU(fc1(x)). After teacher training, fc1 encodes digit-discriminative features — these flow into the ghost logits on any input, including random noise.
Random readout is sufficient for above-chance accuracy. Once h_student ≈ h_teacher, the digit readout weights (random, frozen at init) act as a random projection of the 256-d digit-structured representation. By Johnson-Lindenstrauss, random projections of well-clustered data preserve class separation: result is 54.9% vs 10.2% chance.

Why does more teacher training hurt the student? (Exp 2)

As the teacher trains longer, its hidden layer becomes increasingly specialized. The 3-dimensional ghost channel can only capture a small projection of this representation — and as specialization increases, that projection becomes less informative about the underlying digit structure relative to the shared initialization. The ghost channel is a bottleneck; a more specialized teacher is harder to transmit through it.

Why does random noise outperform real MNIST images? (Exp 3)

Real digit images allow the student to find shortcut solutions: it can match the teacher's ghost logit distribution for MNIST inputs by learning to detect low-level pixel features (edges, loops), without genuinely converging toward the teacher's internal representation. Random noise eliminates these shortcuts — the student must replicate the teacher's representation to minimize the ghost KL loss, and this representation convergence is what transfers digit knowledge.

Topic B — Token Entanglement Findings

Is the effect real and widespread?

Yes. Testing 25 animals (vs. 2 in the original paper), 23/25 show >2x entanglement effect. Mean ratio: 216x, median: 24x. The original paper's owls (203x) and eagles (663x) are not exceptional — hawks (733x), tigers (697x), sparrows (2284x) exceed them. No evidence of cherry-picking.

Does RLHF change the entanglement? (Base vs. Instruct)

No. Cosine similarities between entangled pairs are nearly identical between base and instruct models (e.g., owl/087: 0.1232 vs 0.1291). The entanglement is a pre-training phenomenon encoded in the unembedding matrix geometry, which RLHF fine-tuning does not meaningfully alter.

Does unembedding geometry explain the entanglement?

Partially, but insufficiently. Entangled numbers do have slightly higher cosine similarity to their animal (0.085 vs 0.066 for random numbers). However:

Top-10 cosine-similarity numbers overlap with top-10 entangled numbers only 2/10
Prompting with high-cosine-sim numbers yields mean ratio 0.60x (actually decreasing P(owl)), max 3.3x
Actual entangled numbers (087, 747) achieve 200x+

Best guess at mechanism: ~60% pre-training co-occurrence (owl+087 appeared together in training text), ~30% softmax amplification (small advantages → large ratios near zero baseline), ~10% unembedding geometry.

Hardware & Reproduction

All experiments run on a single NVIDIA RTX 3090 (24GB). Topic A experiments take ~2-15 minutes total. Topic B requires downloading Llama-3.2-1B-Instruct (~2.5GB) and Llama-3.2-1B (~2.5GB).

# Install dependencies
pip install torch torchvision transformers matplotlib pandas scipy tqdm

# Topic A baseline
python topic_a.py

# Topic A experiments
python topic_a_exp1.py   # ~3 min
python topic_a_exp2.py   # ~5 min
python topic_a_exp3.py   # ~5 min
python topic_a_maximize.py  # ~15 min

# Topic B (requires GPU + ~5GB disk for models)
HF_HOME=~/.cache/huggingface python topic_b_part1.py
HF_HOME=~/.cache/huggingface python topic_b_step2_extended.py
HF_HOME=~/.cache/huggingface python topic_b_step3_base_vs_instruct.py
HF_HOME=~/.cache/huggingface python topic_b_part3.py

Experiment Plots

Topic A

Exp 1: Distillation Epochs	Exp 2: Teacher Epochs	Exp 3: Data Distribution

Topic B

25 Animals: Entanglement Ratios	Dot Product vs Random	Base vs Instruct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Subliminal Learning: Replication & Mechanistic Experiments

Results at a Glance

Topic A — MNIST Knowledge Distillation

Topic B — Llama-3.2-1B Token Entanglement

Repo Structure

Topic A — Mechanistic Findings

Setup

Why does the student learn at all?

Why does more teacher training hurt the student? (Exp 2)

Why does random noise outperform real MNIST images? (Exp 3)

Topic B — Token Entanglement Findings

Is the effect real and widespread?

Does RLHF change the entanglement? (Base vs. Instruct)

Does unembedding geometry explain the entanglement?

Hardware & Reproduction

Experiment Plots

Topic A

Topic B

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
plots_a		plots_a
plots_b		plots_b
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
topic_a.py		topic_a.py
topic_a_exp1.py		topic_a_exp1.py
topic_a_exp2.py		topic_a_exp2.py
topic_a_exp3.py		topic_a_exp3.py
topic_a_maximize.py		topic_a_maximize.py
topic_b_part1.py		topic_b_part1.py
topic_b_part2.py		topic_b_part2.py
topic_b_part3.py		topic_b_part3.py
topic_b_step2_extended.py		topic_b_step2_extended.py
topic_b_step3_base_vs_instruct.py		topic_b_step3_base_vs_instruct.py
topic_b_utils.py		topic_b_utils.py

Folders and files

Latest commit

History

Repository files navigation

Subliminal Learning: Replication & Mechanistic Experiments

Results at a Glance

Topic A — MNIST Knowledge Distillation

Topic B — Llama-3.2-1B Token Entanglement

Repo Structure

Topic A — Mechanistic Findings

Setup

Why does the student learn at all?

Why does more teacher training hurt the student? (Exp 2)

Why does random noise outperform real MNIST images? (Exp 3)

Topic B — Token Entanglement Findings

Is the effect real and widespread?

Does RLHF change the entanglement? (Base vs. Instruct)

Does unembedding geometry explain the entanglement?

Hardware & Reproduction

Experiment Plots

Topic A

Topic B

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages