Independent replication and extension of Subliminal Learning: Language Models Transmit Behavioral Traits Via Hidden Signals in Data and the companion paper Token Entanglement in Subliminal Learning.
The core claim: a student model can acquire a teacher's capabilities through distillation on ghost (auxiliary) logits alone — with no access to labels, and without the student ever seeing the teacher's task data. This repo investigates how much subliminal learning occurs under different conditions, why, and what drives it.
| Experiment | Key Finding |
|---|---|
| Distillation epochs (1→40) | Monotonically increases: 19.7% → 77.2% |
| Teacher training epochs (0→20) | Inverted-U: peaks at ep=2 (62.6%), then drops to 35.1% at ep=20 despite teacher improving to 97.4% |
| Data distribution | Counterintuitive: random noise (54.9%) >> real MNIST images (15.1%) |
| Maximize experiment | EPOCHS_TEACHER=2 + EPOCHS_DISTILL=100 + uniform noise → 70.3% (+28% over default baseline) |
Baseline (default settings): teacher 94.3%, student aux-only 54.9%, cross-model control 12.4% (near chance).
| Experiment | Key Finding |
|---|---|
| Animal → number entanglement | 23/25 animals show >2x effect — widespread, not cherry-picked |
| Base vs. instruct model | Identical unembedding geometry; entanglement persists through RLHF |
| Cosine similarity predicts entanglement? | Partially (0.085 vs 0.066 random) — but top cosine-sim numbers fail to increase P(owl); max ratio only 3.3x vs 200x+ for actual entangled pairs |
topic_a.py # Baseline: N=25 parallel MLPs, teacher/student distillation
topic_a_exp1.py # Experiment 1: distillation epoch sweep [1,2,5,10,20,40]
topic_a_exp2.py # Experiment 2: teacher epoch sweep [0,1,2,5,10,20]
topic_a_exp3.py # Experiment 3: data distribution comparison (5 distributions)
topic_a_maximize.py # Maximize: best config based on experiments 1-3
topic_b_part1.py # Baseline: owl↔087 entanglement demo (Llama-3.2-1B-Instruct)
topic_b_part2.py # Bidirectional entanglement: number→animal
topic_b_part3.py # Cosine similarity & dot product geometry analysis
topic_b_step2_extended.py # 25-animal study: cherry-picking test
topic_b_step3_base_vs_instruct.py # Base model vs instruct: does RLHF change entanglement?
topic_b_utils.py # Shared utilities (model loading, prompting, token search)
plots_a/ # All Topic A experiment figures + CSVs
plots_b/ # All Topic B experiment figures + CSVs
- 2-layer MLP (784→256→256→13), 25 parallel models trained simultaneously
- Teacher trains on MNIST (digit labels); student distills via KL on ghost logits only (last 3 of 13 outputs)
- Student digit accuracy measured with random, never-updated digit readout weights
Three mechanisms compound:
-
Shared initialization → representation convergence. Student and teacher start from identical weights. KL distillation drives
fc1_student → fc1_teacheralong the natural gradient path from the shared starting point. Cross-model control (12.4% ≈ chance) confirms: without shared init, distillation fails even with identical architecture. -
Ghost logits carry implicit digit information. Ghost outputs = fc2_ghost @ ReLU(fc1(x)). After teacher training, fc1 encodes digit-discriminative features — these flow into the ghost logits on any input, including random noise.
-
Random readout is sufficient for above-chance accuracy. Once
h_student ≈ h_teacher, the digit readout weights (random, frozen at init) act as a random projection of the 256-d digit-structured representation. By Johnson-Lindenstrauss, random projections of well-clustered data preserve class separation: result is 54.9% vs 10.2% chance.
As the teacher trains longer, its hidden layer becomes increasingly specialized. The 3-dimensional ghost channel can only capture a small projection of this representation — and as specialization increases, that projection becomes less informative about the underlying digit structure relative to the shared initialization. The ghost channel is a bottleneck; a more specialized teacher is harder to transmit through it.
Real digit images allow the student to find shortcut solutions: it can match the teacher's ghost logit distribution for MNIST inputs by learning to detect low-level pixel features (edges, loops), without genuinely converging toward the teacher's internal representation. Random noise eliminates these shortcuts — the student must replicate the teacher's representation to minimize the ghost KL loss, and this representation convergence is what transfers digit knowledge.
Yes. Testing 25 animals (vs. 2 in the original paper), 23/25 show >2x entanglement effect. Mean ratio: 216x, median: 24x. The original paper's owls (203x) and eagles (663x) are not exceptional — hawks (733x), tigers (697x), sparrows (2284x) exceed them. No evidence of cherry-picking.
No. Cosine similarities between entangled pairs are nearly identical between base and instruct models (e.g., owl/087: 0.1232 vs 0.1291). The entanglement is a pre-training phenomenon encoded in the unembedding matrix geometry, which RLHF fine-tuning does not meaningfully alter.
Partially, but insufficiently. Entangled numbers do have slightly higher cosine similarity to their animal (0.085 vs 0.066 for random numbers). However:
- Top-10 cosine-similarity numbers overlap with top-10 entangled numbers only 2/10
- Prompting with high-cosine-sim numbers yields mean ratio 0.60x (actually decreasing P(owl)), max 3.3x
- Actual entangled numbers (087, 747) achieve 200x+
Best guess at mechanism: ~60% pre-training co-occurrence (owl+087 appeared together in training text), ~30% softmax amplification (small advantages → large ratios near zero baseline), ~10% unembedding geometry.
All experiments run on a single NVIDIA RTX 3090 (24GB). Topic A experiments take ~2-15 minutes total. Topic B requires downloading Llama-3.2-1B-Instruct (~2.5GB) and Llama-3.2-1B (~2.5GB).
# Install dependencies
pip install torch torchvision transformers matplotlib pandas scipy tqdm
# Topic A baseline
python topic_a.py
# Topic A experiments
python topic_a_exp1.py # ~3 min
python topic_a_exp2.py # ~5 min
python topic_a_exp3.py # ~5 min
python topic_a_maximize.py # ~15 min
# Topic B (requires GPU + ~5GB disk for models)
HF_HOME=~/.cache/huggingface python topic_b_part1.py
HF_HOME=~/.cache/huggingface python topic_b_step2_extended.py
HF_HOME=~/.cache/huggingface python topic_b_step3_base_vs_instruct.py
HF_HOME=~/.cache/huggingface python topic_b_part3.py| Exp 1: Distillation Epochs | Exp 2: Teacher Epochs | Exp 3: Data Distribution |
|---|---|---|
![]() |
![]() |
![]() |
| 25 Animals: Entanglement Ratios | Dot Product vs Random | Base vs Instruct |
|---|---|---|
![]() |
![]() |
![]() |





