Skip to content

Latest commit

 

History

History
25 lines (24 loc) · 16.4 KB

File metadata and controls

25 lines (24 loc) · 16.4 KB

EXPERIMENTS

run_id date config_hash git_commit model layer/hookpoint coder_type (SAE/transcoder/eval) width sparsity/L0 dataset tokens seed hardware wall_clock cost_est key_results label (repro/novel/inconclusive) notes
repro-001 2026-06-21 modal:l12/w16k/canonical d27d8f3 google/gemma-2-2b blocks.12.hook_resid_post eval (pretrained SAE recon) 16k 83.3 (measured) NeelNanda/pile-10k[:64] 7864 n/a (eval) Modal L4 24GB ~4 min ~$0.05 variance_explained=0.797, mean_L0=83.3 reproduced Gemma Scope canonical SAE recon in documented ballpark (0.79-0.81 / L080). Uses TransformerLens resid_post; raw HF hidden_states gave VE=-4.5 (ADR-0003). BOS excluded.
repro-002 2026-06-21 modal:multilayer/w16k 3583606 google/gemma-2-2b blocks.{5,12,19}.hook_resid_post eval (pretrained SAE recon) 16k 73.8/84.0/76.1 NeelNanda/pile-10k[:96] 11592/layer n/a (eval) Modal L4 24GB ~6 min ~$0.08 VE: L5=0.802, L12=0.796, L19=0.794 reproduced Multi-layer recon: VE ~0.79-0.80 consistent across depth, all in documented Gemma Scope range; L12 reproduces repro-001 (0.796 vs 0.797 => deterministic).
repro-003 2026-06-22 modal:saebench-sparseprobing/l12/w16k bff45da google/gemma-2-2b blocks.12.hook_resid_post eval (SAEBench sparse_probing) 16k canonical (~82) LabHC/bias_in_bios_class_set1 probe 1500/500 42 Modal L4 24GB ~5 min ~$0.10 SAE top-1 acc=0.767 vs residual baseline top-1=0.688; full-feat 0.964/0.965 reproduced SAEBench sparse_probing: SAE probe beats residual baseline by ~8pts on top-1, matching the paper (residual ~0.65, SAEs above). Single-dataset smoke; scale to 8 datasets x k{1,2,5} for the full paper number.
repro-004 2026-06-22 modal:autointerp/l12/w16k/avgl0_82 ed8ef81 google/gemma-2-2b layers.12 (delphi) / blocks.12.hook_resid_post eval (delphi auto-interp, LOCAL scorer) 16k avg_l0_82 NeelNanda/pile-10k (cache) ~198k cached 22 Modal L4 24GB ~8 min ~$0.30 detection acc=0.544, fuzz acc=0.529 (18 latents, 1355/1370 examples); scorer=Qwen2.5-3B-Instruct inconclusive Auto-interp PIPELINE reproduced end-to-end (cache->explain->detection+fuzz). Absolute scores ~chance (0.5) because the 3B local scorer is far weaker than the frontier scorers papers use (1.5B failed delphi's output format entirely). Method reproduced; absolute scores scorer-limited. Phase-4 randomized-model GAP (same scorer) is the real signal.
smoke-p70-sae 2026-06-22 modal:train_pythia70m_smoke/sae 4548cb1 EleutherAI/pythia-70m-deduped layers.3 SAE (sparsify topk) 4096 k=32 NeelNanda/pile-10k ~500k 0 Modal L4 ~5s train (30 steps) ~$0.05 trained + saved loadable dict (layers.3/sae.safetensors, 6 files); transcode/skip=F/F novel SMOKE only (convergence + save/load); undertrained, NO metric claim. Validates the sparsify wrapper end-to-end.
smoke-p70-tc 2026-06-22 modal:train_pythia70m_smoke/transcoder 4548cb1 EleutherAI/pythia-70m-deduped layers.3 skip-transcoder (sparsify topk) 4096 k=32 NeelNanda/pile-10k ~500k 0 Modal L4 ~5s train ~$0.05 trained + saved loadable dict; transcode/skip=T/T novel SMOKE only; validates the transcode+skip-connection path end-to-end.
train-g2-sae 2026-06-22 modal:train_gemma2_2b_l12/sae 114a873 google/gemma-2-2b layers.12 SAE (sparsify topk) 16384 k=64 (exact L0) NeelNanda/pile-10k ~10M (1220 steps x bs8 x ctx1024) 0 Modal L4 24GB ~25 min ~$0.33 trained + saved (layers.12/sae.safetensors); transcode/skip=F/F novel Custom SAE (ADR-0004 sparsify, bf16+batch8). Reconstruction FVU/VE computed at Phase-3 eval (head-to-head). Budget-constrained token count; comparison fairness comes from identical config vs the transcoder.
train-g2-tc 2026-06-22 modal:train_gemma2_2b_l12/transcoder 114a873 google/gemma-2-2b layers.12 (MLP in->out) skip-transcoder (sparsify topk) 16384 k=64 (exact L0) NeelNanda/pile-10k ~10M (1220 steps x bs8 x ctx1024) 0 Modal L4 24GB ~25 min ~$0.33 trained + saved (layers.12/sae.safetensors); transcode/skip=T/T novel Custom skip-transcoder, SAME recipe as train-g2-sae (fair head-to-head). FVU/VE computed at Phase-3 eval.
recon-g2-sae 2026-06-22 modal:recon/sae 58b9f72 google/gemma-2-2b layers.12 resid_post (HF) eval (recon, custom SAE) 16384 k=64 NeelNanda/pile-10k[:96] ~12k 0 Modal L4 ~5 min ~$0.07 variance_explained=0.514 (CI95 [0.507,0.519]) novel Custom SAE reconstruction on its own objective (resid). Modest vs Gemma Scope 0.80 = 10M-token budget (undertrained), not a method issue. Transcoder own-objective recon NOT cleanly isolable externally (sparsify transcode hooks) -> reconstruction axis is SAE-only.
ai-g2-sae 2026-06-22 modal:autointerp/sae 2df597f google/gemma-2-2b layers.12 eval (delphi auto-interp, custom SAE) 16384 k=64 NeelNanda/pile-10k ~200k cache 0 Modal L4 ~25 min ~$0.35 detection=0.540, fuzz=0.523 (58 latents, Qwen2.5-3B local scorer) inconclusive Near chance; matches repro-004 Gemma-Scope-SAE (0.544/0.529). Scorer-limited.
ai-g2-tc 2026-06-22 modal:autointerp/transcoder 2df597f google/gemma-2-2b layers.12 eval (delphi auto-interp, custom transcoder) 16384 k=64 NeelNanda/pile-10k ~200k cache 0 Modal L4 ~25 min ~$0.35 detection=0.539, fuzz=0.546 (61 latents) inconclusive HEAD-TO-HEAD vs ai-g2-sae: detection Δ-0.001 CI95[-0.022,+0.022]; fuzz Δ+0.023 CI95[-0.001,+0.047]. Both CIs include 0 -> no significant SAE-vs-transcoder difference at this scorer scale.
ai-g2-7b-ATTEMPT 2026-06-23 modal:autointerp/sae+tc (7B scorer) bf0826c google/gemma-2-2b layers.12 eval (delphi auto-interp, STRONGER local scorer) 16384 k=64 NeelNanda/pile-10k ~200k cache 0 Modal L4 24GB 2x ~5 min (died at startup) ~$0.15 total NO RESULT on L4, vLLM scorer engine failed to start with Qwen2.5-7B inconclusive (no result), RESOLVED by ai-g2-sae-7b / ai-g2-tc-7b on A100-40GB Stronger-scorer attempt to move the near-chance auto-interp bottleneck. BLOCKER (not an OOM): delphi keeps the Gemma-2-2B base model resident on the GPU through scoring, leaving only ~16/22 GiB free on the L4; vLLM's request_memory guard rejects gpu_memory_utilization 0.9 (19.83>16.05 GiB free) and 0.5 underfits the 7B's ~14.3 GiB weights + KV cache. No max_memory fraction works while the base model is resident on a 24 GiB GPU. RESOLUTION: the 7B fits next to the resident base model on an A100-40GB (auto_interp_custom_a100, max_memory=0.65), see ai-g2-sae-7b / ai-g2-tc-7b below for the real scores. The L4 3B path (ai-g2-sae/tc) is unchanged and still reported alongside.
ai-g2-sae-7b 2026-06-23 modal:autointerp/sae (7B scorer) (this commit) google/gemma-2-2b layers.12 eval (delphi auto-interp, custom SAE, 7B scorer) 16384 k=64 NeelNanda/pile-10k ~200k cache 0 Modal A100-40GB ~8 min ~$0.30 detection=0.6072, fuzz=0.6309 (58 latents, Qwen2.5-7B-Instruct, max_memory=0.65) novel Stronger LOCAL scorer (C1) on the SAE coder. Both metrics rise WELL above the 3B near-chance (det 0.540->0.607 +0.067; fuzz 0.523->0.631 +0.108): the 3B near-chance was a SCORER artifact, not a coder limit. Same pre-registered latent target as ai-g2-sae; delphi drops different unscoreable latents per coder so the head-to-head is unpaired (n=58). Resolves the ai-g2-7b-ATTEMPT blocker via auto_interp_custom_a100 (base model + 7B coexist on 40 GiB).
ai-g2-tc-7b 2026-06-23 modal:autointerp/transcoder (7B scorer) (this commit) google/gemma-2-2b layers.12 eval (delphi auto-interp, custom transcoder, 7B scorer) 16384 k=64 NeelNanda/pile-10k ~200k cache 0 Modal A100-40GB ~8 min ~$0.30 detection=0.6602, fuzz=0.6895 (60 latents, Qwen2.5-7B-Instruct, max_memory=0.65) novel HEAD-TO-HEAD vs ai-g2-sae-7b (unpaired diff-of-means bootstrap, seed 0, 10k resamples, SAME method as the 3B ai-g2 row): detection TC-SAE=+0.053 CI95[+0.016,+0.089]; fuzz TC-SAE=+0.059 CI95[+0.019,+0.097]. Both CIs now EXCLUDE 0 => the skip-transcoder is significantly MORE interpretable than the SAE on both metrics. This FLIPS the 3B verdict (which was inconclusive, both CIs incl 0) and CONFIRMS the pre-registered Transcoders-Beat-SAEs hypothesis on the interpretability axis. recompute: scripts/headtohead_autointerp.py.
ctrl-probe-real 2026-06-22 modal:probing/real (probing_eval) google/gemma-2-2b layers.12 resid (HF) control (SAE-feature linear probe) 16384 k=64 LabHC/bias_in_bios (prof 21 vs 19) 600 ex 0 Modal L4 ~8 min ~$0.10 sae_probe_acc=0.9333 (CI [0.894,0.967]) reproduced(control) Real-model SAE features separate professions. Paired vs ctrl-probe-random below.
ctrl-steer 2026-06-22 modal:steer (steer_eval) google/gemma-2-2b layers.12 control (steering: SAE-feat vs diff-of-means) 16384 k=64 LabHC/bias_in_bios (prof contrast) n_gen=16/coef, coefs[2,4,8]xRMS 0 Modal L4 ~15 min ~$0.25 SAE-dom success diff=0.0 CI95[-0.25,+0.25]; both best-coef=0 (no steering beat baseline within fluency); baseline_success=0.812 inconclusive SUPERSEDED by ctrl-steer-v2. Control B (ADR-0005) ran end-to-end but DEGENERATE: "My favorite" prompt baseline already 0.812 (probe ceiling, no headroom) + coarse coefs [2,4,8] all broke the fluency cap => both best-coef=0, diff trivially 0.0. Re-run recalibrated (neutral prompt + finer grid) as ctrl-steer-v2.
ctrl-steer-v2 2026-06-23 modal:steer (recalibrated) (steer_eval) google/gemma-2-2b layers.12 control (steering: SAE-feat vs diff-of-means) 16384 k=64 LabHC/bias_in_bios (prof 21 vs 19, steer->19) n_scan=8/prompt, n_gen=16/coef, coefs[0.5,1,2,3,4]xRMS 0 Modal L4 ~30 min ~$0.45 neutral prompt='This person' (baseline_success=0.562, ppl 9.18, cap 13.76); SAE best coef0.5 success=0.875 (effect +0.312); dom best coef0.5 success=0.938 (effect +0.375); SAE-dom=-0.062 CI95[-0.25,+0.125] inconclusive CALIBRATION FIX of ctrl-steer (same ADR-0005 metric/concept, NOT a new Gate-4 decision). Prompt scan over 6 candidates picked the one closest to 0.5 baseline ('This person'=0.5; vs 'My favorite'=1.0, 'I'=1.0 ceilings). Finer grid => both directions now show large fluency-preserving steering effects (no longer degenerate); coefs>=1-2 break fluency (ppl 16->2268). Head-to-head: dom matches/slightly beats the SAE feature, CI incl 0 (R4 honest, AxBench expectation). top_feature=795, resid_rms=104.4.
circuit-g2-sae 2026-06-22 modal:circuit (circuit_eval) google/gemma-2-2b layers.12 circuit (SAE-feature faithfulness vs random) 16384 k=64 LabHC/bias_in_bios (prof 21v19) 600 ex 0 Modal L4 ~10 min ~$0.12 top5=0.878 (94% of 0.933 ceiling) vs random5=0.583 gap+0.294 CI[0.206,0.383]; top10=0.906 (97%); top20=0.906; top50=0.922; ALL K beat random (CI excl 0) novel SPARSE FAITHFUL circuit (ADR-0006): ~5-10 SAE features carry the profession distinction. Circuit ids [3955,1649,1962,5409,6053,14086,7688,4295,2258,11850]. Caveat: same token-influenced behavior as Control A => partly token features.
saebench-custom-sae (v1, BUGGY) 2026-06-23 modal:saebench-sparseprobing-custom/sae 661119c google/gemma-2-2b blocks.12.hook_resid_post eval (SAEBench sparse_probing, CUSTOM SAE via sparsify->sae_lens adapter) 16384 k=64 LabHC/bias_in_bios_class_set1 probe 1500/500 42 Modal L4 24GB ~6 min $0.5 (+$0.2 E4 probes/pre-flight) sae_top_1=0.6668 (ARTIFACT) superseded SUPERSEDED by saebench-custom-sae-v2, the adapter had an encode bug. _sparsify_to_topk_sae set apply_b_dec_to_input=False on the premise that sparsify's TopK encode does not subtract a decoder bias. That premise is FALSE: the installed sparsify SparseCoder.encode does if not self.cfg.transcode: x = x - self.b_dec before the fused encoder (verified in source, probe_sparsify_encode). With =False the adapter encoded x@Wencᵀ+b_enc while sparsify's true encode is (x-b_dec)@Wencᵀ+b_enc (b_dec norm ≈ 90.7), so ~93% DIFFERENT TopK latents fired (encode-fidelity Jaccard ≈ 0.07, cosine 0.139), i.e. 0.6668 was an adapter artifact, not the real SAE's number.
circuit-multilayer 2026-06-23 modal:circuit-multilayer/l5_12_19/w16k (this commit) google/gemma-2-2b blocks.{5,12,19}.hook_resid_post (TransformerLens) circuit (multi-layer Gemma Scope SAE feature-SET, faithfulness vs random + build-up) 16k x3 (49152 total) canonical Gemma Scope (~56-62 active/tok) LabHC/bias_in_bios (prof 21 vs 19) 600 ex 0 Modal L4 24GB ~13 min $0.2 (+$0.15 E4 probe + 1 import-fail) ceiling(49152 feats)=0.9444; K/layer=3 (9 nodes) circuit=0.9167 (97.1% of ceiling) vs random=0.5944 gap+0.322 CI95[0.239,0.406]; K/layer=5 (15) circuit=0.9389 (99.4%) vs random=0.7778 gap+0.161 CI95[0.094,0.233]; K/layer=10 (30) circuit=0.9500 (100.6%) vs random=0.6667 gap+0.283 CI95[0.206,0.356]; ALL K beat random (CI excl 0). Build-up (K/layer=5): L5=0.9111, L5+L12=0.9389, L5+L12+L19=0.9389 novel MULTI-LAYER (cross-layer feature-SET) circuit (ADR-0008), the deferred extension of the single-layer circuit-g2-sae. Uses PRETRAINED Gemma Scope SAEs at L5/12/19 (the 3 layers reproduced in repro-002) on the TransformerLens resid_post recipe (BOS excluded; raw HF acts gave VE -4.5 so this recipe is mandatory). Probe-independent attribution (class-mean diff) per layer -> union of per-layer top-K = circuit; fresh probe on circuit features (concat across layers) vs same-size RANDOM cross-layer set vs full ceiling, bootstrap CI on circuit-minus-random gap (R2/R3). A small cross-layer set (9-30 features over 3 layers) is FAITHFUL (97-101% of ceiling) and beats the random cross-layer control at every K (CI excl 0) => sparse multi-layer circuit. Build-up curve: the profession concept is essentially built by L5->L12 (L5 alone 0.911; adding L12 -> 0.939; L19 adds nothing on top, +0.000), i.e. it accumulates by mid-depth and saturates. SCOPE (R4): this is a cross-layer feature-SET circuit + depth build-up, NOT a feature->feature causal edge graph (the heavier sparse-feature-circuits/attribution-patching version is the remaining follow-up). Same token-influence caveat as Control A (much of this profession signal is token-level). Top features per layer (K/layer=5): L5 [12872,5411,14908,28,807], L12 [6810,23,5364,1041,10603], L19 [4346,10992,12025,7663,14180]. result: /root/outputs/circuit_multilayer.json.
saebench-custom-sae-v2 2026-06-23 modal:saebench-sparseprobing-custom/sae (this commit) google/gemma-2-2b blocks.12.hook_resid_post eval (SAEBench sparse_probing, CUSTOM SAE via sparsify->sae_lens adapter, b_dec FIXED) 16384 k=64 LabHC/bias_in_bios_class_set1 probe 1500/500 42 Modal L4 24GB ~6 min $0.15 (+$0.05 verify) sae_top_1=0.670; residual(llm) baseline top_1=0.6876; full-feat sae=0.9496 / llm=0.9648 novel CORRECTED ADR-0007 result. Adapter now sets apply_b_dec_to_input=True so sae_lens performs the SAME (x - b_dec) shift sparsify does (b_dec is copied into the SAE). ENCODE-FIDELITY (the bug-catching check, verify_saebench_adapter, persisted to saebench_adapter_verify.json): adapter.encode reproduces sparsify coder.encode EXACTLY on random AND real-resid batches, same active TopK indices (Jaccard 1.0 every row), max abs value diff 6e-6 (random) / 7.6e-5 (real-resid), cosine 1.0. The buggy =False variant FAILS the same check (Jaccard ≈ 0.07, cosine 0.139), confirmed in the same run. HONEST RESULT (R4): the budget-trained custom SAE (recon VE 0.51) scores 0.670 < Gemma Scope 0.767 AND < its own residual baseline 0.688, on this single-dataset top-1 probe the budget SAE's best feature does NOT beat the raw residual (opposite of repro-003). The b_dec fix moved the number only +0.003 (0.667→0.670: a top-1 best-single-feature probe is robust to which near-equivalent budget latents are selected), but the result now rests on a verified-correct encode rather than an artifact. Baseline 0.6876 == repro-003's baseline (SAE-independent) => eval is sound + comparable. Transcoder = N/A (R3). Decoder-norm note: mean row norm ~1.004, a few rows drift ~0.07 => check_decoder_norms warns (does not raise). result: /root/outputs/saebench_custom_sae.json; fidelity: /root/outputs/saebench_adapter_verify.json.