| repro-001 |
2026-06-21 |
modal:l12/w16k/canonical |
d27d8f3 |
google/gemma-2-2b |
blocks.12.hook_resid_post |
eval (pretrained SAE recon) |
16k |
83.3 (measured) |
NeelNanda/pile-10k[:64] |
7864 |
n/a (eval) |
Modal L4 24GB |
~4 min |
~$0.05 |
variance_explained=0.797, mean_L0=83.3 |
reproduced |
Gemma Scope canonical SAE recon in documented ballpark (0.79-0.81 / L080). Uses TransformerLens resid_post; raw HF hidden_states gave VE=-4.5 (ADR-0003). BOS excluded. |
| repro-002 |
2026-06-21 |
modal:multilayer/w16k |
3583606 |
google/gemma-2-2b |
blocks.{5,12,19}.hook_resid_post |
eval (pretrained SAE recon) |
16k |
73.8/84.0/76.1 |
NeelNanda/pile-10k[:96] |
11592/layer |
n/a (eval) |
Modal L4 24GB |
~6 min |
~$0.08 |
VE: L5=0.802, L12=0.796, L19=0.794 |
reproduced |
Multi-layer recon: VE ~0.79-0.80 consistent across depth, all in documented Gemma Scope range; L12 reproduces repro-001 (0.796 vs 0.797 => deterministic). |
| repro-003 |
2026-06-22 |
modal:saebench-sparseprobing/l12/w16k |
bff45da |
google/gemma-2-2b |
blocks.12.hook_resid_post |
eval (SAEBench sparse_probing) |
16k |
canonical (~82) |
LabHC/bias_in_bios_class_set1 |
probe 1500/500 |
42 |
Modal L4 24GB |
~5 min |
~$0.10 |
SAE top-1 acc=0.767 vs residual baseline top-1=0.688; full-feat 0.964/0.965 |
reproduced |
SAEBench sparse_probing: SAE probe beats residual baseline by ~8pts on top-1, matching the paper (residual ~0.65, SAEs above). Single-dataset smoke; scale to 8 datasets x k{1,2,5} for the full paper number. |
| repro-004 |
2026-06-22 |
modal:autointerp/l12/w16k/avgl0_82 |
ed8ef81 |
google/gemma-2-2b |
layers.12 (delphi) / blocks.12.hook_resid_post |
eval (delphi auto-interp, LOCAL scorer) |
16k |
avg_l0_82 |
NeelNanda/pile-10k (cache) |
~198k cached |
22 |
Modal L4 24GB |
~8 min |
~$0.30 |
detection acc=0.544, fuzz acc=0.529 (18 latents, 1355/1370 examples); scorer=Qwen2.5-3B-Instruct |
inconclusive |
Auto-interp PIPELINE reproduced end-to-end (cache->explain->detection+fuzz). Absolute scores ~chance (0.5) because the 3B local scorer is far weaker than the frontier scorers papers use (1.5B failed delphi's output format entirely). Method reproduced; absolute scores scorer-limited. Phase-4 randomized-model GAP (same scorer) is the real signal. |
| smoke-p70-sae |
2026-06-22 |
modal:train_pythia70m_smoke/sae |
4548cb1 |
EleutherAI/pythia-70m-deduped |
layers.3 |
SAE (sparsify topk) |
4096 |
k=32 |
NeelNanda/pile-10k |
~500k |
0 |
Modal L4 |
~5s train (30 steps) |
~$0.05 |
trained + saved loadable dict (layers.3/sae.safetensors, 6 files); transcode/skip=F/F |
novel |
SMOKE only (convergence + save/load); undertrained, NO metric claim. Validates the sparsify wrapper end-to-end. |
| smoke-p70-tc |
2026-06-22 |
modal:train_pythia70m_smoke/transcoder |
4548cb1 |
EleutherAI/pythia-70m-deduped |
layers.3 |
skip-transcoder (sparsify topk) |
4096 |
k=32 |
NeelNanda/pile-10k |
~500k |
0 |
Modal L4 |
~5s train |
~$0.05 |
trained + saved loadable dict; transcode/skip=T/T |
novel |
SMOKE only; validates the transcode+skip-connection path end-to-end. |
| train-g2-sae |
2026-06-22 |
modal:train_gemma2_2b_l12/sae |
114a873 |
google/gemma-2-2b |
layers.12 |
SAE (sparsify topk) |
16384 |
k=64 (exact L0) |
NeelNanda/pile-10k |
~10M (1220 steps x bs8 x ctx1024) |
0 |
Modal L4 24GB |
~25 min |
~$0.33 |
trained + saved (layers.12/sae.safetensors); transcode/skip=F/F |
novel |
Custom SAE (ADR-0004 sparsify, bf16+batch8). Reconstruction FVU/VE computed at Phase-3 eval (head-to-head). Budget-constrained token count; comparison fairness comes from identical config vs the transcoder. |
| train-g2-tc |
2026-06-22 |
modal:train_gemma2_2b_l12/transcoder |
114a873 |
google/gemma-2-2b |
layers.12 (MLP in->out) |
skip-transcoder (sparsify topk) |
16384 |
k=64 (exact L0) |
NeelNanda/pile-10k |
~10M (1220 steps x bs8 x ctx1024) |
0 |
Modal L4 24GB |
~25 min |
~$0.33 |
trained + saved (layers.12/sae.safetensors); transcode/skip=T/T |
novel |
Custom skip-transcoder, SAME recipe as train-g2-sae (fair head-to-head). FVU/VE computed at Phase-3 eval. |
| recon-g2-sae |
2026-06-22 |
modal:recon/sae |
58b9f72 |
google/gemma-2-2b |
layers.12 resid_post (HF) |
eval (recon, custom SAE) |
16384 |
k=64 |
NeelNanda/pile-10k[:96] |
~12k |
0 |
Modal L4 |
~5 min |
~$0.07 |
variance_explained=0.514 (CI95 [0.507,0.519]) |
novel |
Custom SAE reconstruction on its own objective (resid). Modest vs Gemma Scope 0.80 = 10M-token budget (undertrained), not a method issue. Transcoder own-objective recon NOT cleanly isolable externally (sparsify transcode hooks) -> reconstruction axis is SAE-only. |
| ai-g2-sae |
2026-06-22 |
modal:autointerp/sae |
2df597f |
google/gemma-2-2b |
layers.12 |
eval (delphi auto-interp, custom SAE) |
16384 |
k=64 |
NeelNanda/pile-10k |
~200k cache |
0 |
Modal L4 |
~25 min |
~$0.35 |
detection=0.540, fuzz=0.523 (58 latents, Qwen2.5-3B local scorer) |
inconclusive |
Near chance; matches repro-004 Gemma-Scope-SAE (0.544/0.529). Scorer-limited. |
| ai-g2-tc |
2026-06-22 |
modal:autointerp/transcoder |
2df597f |
google/gemma-2-2b |
layers.12 |
eval (delphi auto-interp, custom transcoder) |
16384 |
k=64 |
NeelNanda/pile-10k |
~200k cache |
0 |
Modal L4 |
~25 min |
~$0.35 |
detection=0.539, fuzz=0.546 (61 latents) |
inconclusive |
HEAD-TO-HEAD vs ai-g2-sae: detection Δ-0.001 CI95[-0.022,+0.022]; fuzz Δ+0.023 CI95[-0.001,+0.047]. Both CIs include 0 -> no significant SAE-vs-transcoder difference at this scorer scale. |
| ai-g2-7b-ATTEMPT |
2026-06-23 |
modal:autointerp/sae+tc (7B scorer) |
bf0826c |
google/gemma-2-2b |
layers.12 |
eval (delphi auto-interp, STRONGER local scorer) |
16384 |
k=64 |
NeelNanda/pile-10k |
~200k cache |
0 |
Modal L4 24GB |
2x ~5 min (died at startup) |
~$0.15 total |
NO RESULT on L4, vLLM scorer engine failed to start with Qwen2.5-7B |
inconclusive (no result), RESOLVED by ai-g2-sae-7b / ai-g2-tc-7b on A100-40GB |
Stronger-scorer attempt to move the near-chance auto-interp bottleneck. BLOCKER (not an OOM): delphi keeps the Gemma-2-2B base model resident on the GPU through scoring, leaving only ~16/22 GiB free on the L4; vLLM's request_memory guard rejects gpu_memory_utilization 0.9 (19.83>16.05 GiB free) and 0.5 underfits the 7B's ~14.3 GiB weights + KV cache. No max_memory fraction works while the base model is resident on a 24 GiB GPU. RESOLUTION: the 7B fits next to the resident base model on an A100-40GB (auto_interp_custom_a100, max_memory=0.65), see ai-g2-sae-7b / ai-g2-tc-7b below for the real scores. The L4 3B path (ai-g2-sae/tc) is unchanged and still reported alongside. |
| ai-g2-sae-7b |
2026-06-23 |
modal:autointerp/sae (7B scorer) |
(this commit) |
google/gemma-2-2b |
layers.12 |
eval (delphi auto-interp, custom SAE, 7B scorer) |
16384 |
k=64 |
NeelNanda/pile-10k |
~200k cache |
0 |
Modal A100-40GB |
~8 min |
~$0.30 |
detection=0.6072, fuzz=0.6309 (58 latents, Qwen2.5-7B-Instruct, max_memory=0.65) |
novel |
Stronger LOCAL scorer (C1) on the SAE coder. Both metrics rise WELL above the 3B near-chance (det 0.540->0.607 +0.067; fuzz 0.523->0.631 +0.108): the 3B near-chance was a SCORER artifact, not a coder limit. Same pre-registered latent target as ai-g2-sae; delphi drops different unscoreable latents per coder so the head-to-head is unpaired (n=58). Resolves the ai-g2-7b-ATTEMPT blocker via auto_interp_custom_a100 (base model + 7B coexist on 40 GiB). |
| ai-g2-tc-7b |
2026-06-23 |
modal:autointerp/transcoder (7B scorer) |
(this commit) |
google/gemma-2-2b |
layers.12 |
eval (delphi auto-interp, custom transcoder, 7B scorer) |
16384 |
k=64 |
NeelNanda/pile-10k |
~200k cache |
0 |
Modal A100-40GB |
~8 min |
~$0.30 |
detection=0.6602, fuzz=0.6895 (60 latents, Qwen2.5-7B-Instruct, max_memory=0.65) |
novel |
HEAD-TO-HEAD vs ai-g2-sae-7b (unpaired diff-of-means bootstrap, seed 0, 10k resamples, SAME method as the 3B ai-g2 row): detection TC-SAE=+0.053 CI95[+0.016,+0.089]; fuzz TC-SAE=+0.059 CI95[+0.019,+0.097]. Both CIs now EXCLUDE 0 => the skip-transcoder is significantly MORE interpretable than the SAE on both metrics. This FLIPS the 3B verdict (which was inconclusive, both CIs incl 0) and CONFIRMS the pre-registered Transcoders-Beat-SAEs hypothesis on the interpretability axis. recompute: scripts/headtohead_autointerp.py. |
| ctrl-probe-real |
2026-06-22 |
modal:probing/real |
(probing_eval) |
google/gemma-2-2b |
layers.12 resid (HF) |
control (SAE-feature linear probe) |
16384 |
k=64 |
LabHC/bias_in_bios (prof 21 vs 19) |
600 ex |
0 |
Modal L4 |
~8 min |
~$0.10 |
sae_probe_acc=0.9333 (CI [0.894,0.967]) |
reproduced(control) |
Real-model SAE features separate professions. Paired vs ctrl-probe-random below. |
| ctrl-steer |
2026-06-22 |
modal:steer |
(steer_eval) |
google/gemma-2-2b |
layers.12 |
control (steering: SAE-feat vs diff-of-means) |
16384 |
k=64 |
LabHC/bias_in_bios (prof contrast) |
n_gen=16/coef, coefs[2,4,8]xRMS |
0 |
Modal L4 |
~15 min |
~$0.25 |
SAE-dom success diff=0.0 CI95[-0.25,+0.25]; both best-coef=0 (no steering beat baseline within fluency); baseline_success=0.812 |
inconclusive |
SUPERSEDED by ctrl-steer-v2. Control B (ADR-0005) ran end-to-end but DEGENERATE: "My favorite" prompt baseline already 0.812 (probe ceiling, no headroom) + coarse coefs [2,4,8] all broke the fluency cap => both best-coef=0, diff trivially 0.0. Re-run recalibrated (neutral prompt + finer grid) as ctrl-steer-v2. |
| ctrl-steer-v2 |
2026-06-23 |
modal:steer (recalibrated) |
(steer_eval) |
google/gemma-2-2b |
layers.12 |
control (steering: SAE-feat vs diff-of-means) |
16384 |
k=64 |
LabHC/bias_in_bios (prof 21 vs 19, steer->19) |
n_scan=8/prompt, n_gen=16/coef, coefs[0.5,1,2,3,4]xRMS |
0 |
Modal L4 |
~30 min |
~$0.45 |
neutral prompt='This person' (baseline_success=0.562, ppl 9.18, cap 13.76); SAE best coef0.5 success=0.875 (effect +0.312); dom best coef0.5 success=0.938 (effect +0.375); SAE-dom=-0.062 CI95[-0.25,+0.125] |
inconclusive |
CALIBRATION FIX of ctrl-steer (same ADR-0005 metric/concept, NOT a new Gate-4 decision). Prompt scan over 6 candidates picked the one closest to 0.5 baseline ('This person'=0.5; vs 'My favorite'=1.0, 'I'=1.0 ceilings). Finer grid => both directions now show large fluency-preserving steering effects (no longer degenerate); coefs>=1-2 break fluency (ppl 16->2268). Head-to-head: dom matches/slightly beats the SAE feature, CI incl 0 (R4 honest, AxBench expectation). top_feature=795, resid_rms=104.4. |
| circuit-g2-sae |
2026-06-22 |
modal:circuit |
(circuit_eval) |
google/gemma-2-2b |
layers.12 |
circuit (SAE-feature faithfulness vs random) |
16384 |
k=64 |
LabHC/bias_in_bios (prof 21v19) |
600 ex |
0 |
Modal L4 |
~10 min |
~$0.12 |
top5=0.878 (94% of 0.933 ceiling) vs random5=0.583 gap+0.294 CI[0.206,0.383]; top10=0.906 (97%); top20=0.906; top50=0.922; ALL K beat random (CI excl 0) |
novel |
SPARSE FAITHFUL circuit (ADR-0006): ~5-10 SAE features carry the profession distinction. Circuit ids [3955,1649,1962,5409,6053,14086,7688,4295,2258,11850]. Caveat: same token-influenced behavior as Control A => partly token features. |
| saebench-custom-sae (v1, BUGGY) |
2026-06-23 |
modal:saebench-sparseprobing-custom/sae |
661119c |
google/gemma-2-2b |
blocks.12.hook_resid_post |
eval (SAEBench sparse_probing, CUSTOM SAE via sparsify->sae_lens adapter) |
16384 |
k=64 |
LabHC/bias_in_bios_class_set1 |
probe 1500/500 |
42 |
Modal L4 24GB |
~6 min |
$0.5 (+$0.2 E4 probes/pre-flight) |
sae_top_1=0.6668 (ARTIFACT) |
superseded |
SUPERSEDED by saebench-custom-sae-v2, the adapter had an encode bug. _sparsify_to_topk_sae set apply_b_dec_to_input=False on the premise that sparsify's TopK encode does not subtract a decoder bias. That premise is FALSE: the installed sparsify SparseCoder.encode does if not self.cfg.transcode: x = x - self.b_dec before the fused encoder (verified in source, probe_sparsify_encode). With =False the adapter encoded x@Wencᵀ+b_enc while sparsify's true encode is (x-b_dec)@Wencᵀ+b_enc (b_dec norm ≈ 90.7), so ~93% DIFFERENT TopK latents fired (encode-fidelity Jaccard ≈ 0.07, cosine 0.139), i.e. 0.6668 was an adapter artifact, not the real SAE's number. |
| circuit-multilayer |
2026-06-23 |
modal:circuit-multilayer/l5_12_19/w16k |
(this commit) |
google/gemma-2-2b |
blocks.{5,12,19}.hook_resid_post (TransformerLens) |
circuit (multi-layer Gemma Scope SAE feature-SET, faithfulness vs random + build-up) |
16k x3 (49152 total) |
canonical Gemma Scope (~56-62 active/tok) |
LabHC/bias_in_bios (prof 21 vs 19) |
600 ex |
0 |
Modal L4 24GB |
~13 min |
$0.2 (+$0.15 E4 probe + 1 import-fail) |
ceiling(49152 feats)=0.9444; K/layer=3 (9 nodes) circuit=0.9167 (97.1% of ceiling) vs random=0.5944 gap+0.322 CI95[0.239,0.406]; K/layer=5 (15) circuit=0.9389 (99.4%) vs random=0.7778 gap+0.161 CI95[0.094,0.233]; K/layer=10 (30) circuit=0.9500 (100.6%) vs random=0.6667 gap+0.283 CI95[0.206,0.356]; ALL K beat random (CI excl 0). Build-up (K/layer=5): L5=0.9111, L5+L12=0.9389, L5+L12+L19=0.9389 |
novel |
MULTI-LAYER (cross-layer feature-SET) circuit (ADR-0008), the deferred extension of the single-layer circuit-g2-sae. Uses PRETRAINED Gemma Scope SAEs at L5/12/19 (the 3 layers reproduced in repro-002) on the TransformerLens resid_post recipe (BOS excluded; raw HF acts gave VE -4.5 so this recipe is mandatory). Probe-independent attribution (class-mean diff) per layer -> union of per-layer top-K = circuit; fresh probe on circuit features (concat across layers) vs same-size RANDOM cross-layer set vs full ceiling, bootstrap CI on circuit-minus-random gap (R2/R3). A small cross-layer set (9-30 features over 3 layers) is FAITHFUL (97-101% of ceiling) and beats the random cross-layer control at every K (CI excl 0) => sparse multi-layer circuit. Build-up curve: the profession concept is essentially built by L5->L12 (L5 alone 0.911; adding L12 -> 0.939; L19 adds nothing on top, +0.000), i.e. it accumulates by mid-depth and saturates. SCOPE (R4): this is a cross-layer feature-SET circuit + depth build-up, NOT a feature->feature causal edge graph (the heavier sparse-feature-circuits/attribution-patching version is the remaining follow-up). Same token-influence caveat as Control A (much of this profession signal is token-level). Top features per layer (K/layer=5): L5 [12872,5411,14908,28,807], L12 [6810,23,5364,1041,10603], L19 [4346,10992,12025,7663,14180]. result: /root/outputs/circuit_multilayer.json. |
| saebench-custom-sae-v2 |
2026-06-23 |
modal:saebench-sparseprobing-custom/sae |
(this commit) |
google/gemma-2-2b |
blocks.12.hook_resid_post |
eval (SAEBench sparse_probing, CUSTOM SAE via sparsify->sae_lens adapter, b_dec FIXED) |
16384 |
k=64 |
LabHC/bias_in_bios_class_set1 |
probe 1500/500 |
42 |
Modal L4 24GB |
~6 min |
$0.15 (+$0.05 verify) |
sae_top_1=0.670; residual(llm) baseline top_1=0.6876; full-feat sae=0.9496 / llm=0.9648 |
novel |
CORRECTED ADR-0007 result. Adapter now sets apply_b_dec_to_input=True so sae_lens performs the SAME (x - b_dec) shift sparsify does (b_dec is copied into the SAE). ENCODE-FIDELITY (the bug-catching check, verify_saebench_adapter, persisted to saebench_adapter_verify.json): adapter.encode reproduces sparsify coder.encode EXACTLY on random AND real-resid batches, same active TopK indices (Jaccard 1.0 every row), max abs value diff 6e-6 (random) / 7.6e-5 (real-resid), cosine 1.0. The buggy =False variant FAILS the same check (Jaccard ≈ 0.07, cosine 0.139), confirmed in the same run. HONEST RESULT (R4): the budget-trained custom SAE (recon VE 0.51) scores 0.670 < Gemma Scope 0.767 AND < its own residual baseline 0.688, on this single-dataset top-1 probe the budget SAE's best feature does NOT beat the raw residual (opposite of repro-003). The b_dec fix moved the number only +0.003 (0.667→0.670: a top-1 best-single-feature probe is robust to which near-equivalent budget latents are selected), but the result now rests on a verified-correct encode rather than an artifact. Baseline 0.6876 == repro-003's baseline (SAE-independent) => eval is sound + comparable. Transcoder = N/A (R3). Decoder-norm note: mean row norm ~1.004, a few rows drift ~0.07 => check_decoder_norms warns (does not raise). result: /root/outputs/saebench_custom_sae.json; fidelity: /root/outputs/saebench_adapter_verify.json. |