EXPERIMENTS

run_id	date	config_hash	git_commit	model	layer/hookpoint	coder_type (SAE/transcoder/eval)	width	sparsity/L0	dataset	tokens	seed	hardware	wall_clock	cost_est	key_results	label (repro/novel/inconclusive)	notes
repro-001	2026-06-21	modal:l12/w16k/canonical	d27d8f3	google/gemma-2-2b	blocks.12.hook_resid_post	eval (pretrained SAE recon)	16k	83.3 (measured)	NeelNanda/pile-10k[:64]	7864	n/a (eval)	Modal L4 24GB	~4 min	~$0.05	variance_explained=0.797, mean_L0=83.3	reproduced	Gemma Scope canonical SAE recon in documented ballpark (~~0.79-0.81 / L0~~80). Uses TransformerLens resid_post; raw HF hidden_states gave VE=-4.5 (ADR-0003). BOS excluded.
repro-002	2026-06-21	modal:multilayer/w16k	3583606	google/gemma-2-2b	blocks.{5,12,19}.hook_resid_post	eval (pretrained SAE recon)	16k	73.8/84.0/76.1	NeelNanda/pile-10k[:96]	11592/layer	n/a (eval)	Modal L4 24GB	~6 min	~$0.08	VE: L5=0.802, L12=0.796, L19=0.794	reproduced	Multi-layer recon: VE ~0.79-0.80 consistent across depth, all in documented Gemma Scope range; L12 reproduces repro-001 (0.796 vs 0.797 => deterministic).
repro-003	2026-06-22	modal:saebench-sparseprobing/l12/w16k	bff45da	google/gemma-2-2b	blocks.12.hook_resid_post	eval (SAEBench sparse_probing)	16k	canonical (~82)	LabHC/bias_in_bios_class_set1	probe 1500/500	42	Modal L4 24GB	~5 min	~$0.10	SAE top-1 acc=0.767 vs residual baseline top-1=0.688; full-feat 0.964/0.965	reproduced	SAEBench sparse_probing: SAE probe beats residual baseline by ~8pts on top-1, matching the paper (residual ~0.65, SAEs above). Single-dataset smoke; scale to 8 datasets x k{1,2,5} for the full paper number.
repro-004	2026-06-22	modal:autointerp/l12/w16k/avgl0_82	ed8ef81	google/gemma-2-2b	layers.12 (delphi) / blocks.12.hook_resid_post	eval (delphi auto-interp, LOCAL scorer)	16k	avg_l0_82	NeelNanda/pile-10k (cache)	~198k cached	22	Modal L4 24GB	~8 min	~$0.30	detection acc=0.544, fuzz acc=0.529 (18 latents, 1355/1370 examples); scorer=Qwen2.5-3B-Instruct	inconclusive	Auto-interp PIPELINE reproduced end-to-end (cache->explain->detection+fuzz). Absolute scores ~chance (0.5) because the 3B local scorer is far weaker than the frontier scorers papers use (1.5B failed delphi's output format entirely). Method reproduced; absolute scores scorer-limited. Phase-4 randomized-model GAP (same scorer) is the real signal.
smoke-p70-sae	2026-06-22	modal:train_pythia70m_smoke/sae	4548cb1	EleutherAI/pythia-70m-deduped	layers.3	SAE (sparsify topk)	4096	k=32	NeelNanda/pile-10k	~500k	0	Modal L4	~5s train (30 steps)	~$0.05	trained + saved loadable dict (layers.3/sae.safetensors, 6 files); transcode/skip=F/F	novel	SMOKE only (convergence + save/load); undertrained, NO metric claim. Validates the sparsify wrapper end-to-end.
smoke-p70-tc	2026-06-22	modal:train_pythia70m_smoke/transcoder	4548cb1	EleutherAI/pythia-70m-deduped	layers.3	skip-transcoder (sparsify topk)	4096	k=32	NeelNanda/pile-10k	~500k	0	Modal L4	~5s train	~$0.05	trained + saved loadable dict; transcode/skip=T/T	novel	SMOKE only; validates the transcode+skip-connection path end-to-end.
train-g2-sae	2026-06-22	modal:train_gemma2_2b_l12/sae	114a873	google/gemma-2-2b	layers.12	SAE (sparsify topk)	16384	k=64 (exact L0)	NeelNanda/pile-10k	~10M (1220 steps x bs8 x ctx1024)	0	Modal L4 24GB	~25 min	~$0.33	trained + saved (layers.12/sae.safetensors); transcode/skip=F/F	novel	Custom SAE (ADR-0004 sparsify, bf16+batch8). Reconstruction FVU/VE computed at Phase-3 eval (head-to-head). Budget-constrained token count; comparison fairness comes from identical config vs the transcoder.
train-g2-tc	2026-06-22	modal:train_gemma2_2b_l12/transcoder	114a873	google/gemma-2-2b	layers.12 (MLP in->out)	skip-transcoder (sparsify topk)	16384	k=64 (exact L0)	NeelNanda/pile-10k	~10M (1220 steps x bs8 x ctx1024)	0	Modal L4 24GB	~25 min	~$0.33	trained + saved (layers.12/sae.safetensors); transcode/skip=T/T	novel	Custom skip-transcoder, SAME recipe as train-g2-sae (fair head-to-head). FVU/VE computed at Phase-3 eval.
recon-g2-sae	2026-06-22	modal:recon/sae	58b9f72	google/gemma-2-2b	layers.12 resid_post (HF)	eval (recon, custom SAE)	16384	k=64	NeelNanda/pile-10k[:96]	~12k	0	Modal L4	~5 min	~$0.07	variance_explained=0.514 (CI95 [0.507,0.519])	novel	Custom SAE reconstruction on its own objective (resid). Modest vs Gemma Scope 0.80 = 10M-token budget (undertrained), not a method issue. Transcoder own-objective recon NOT cleanly isolable externally (sparsify transcode hooks) -> reconstruction axis is SAE-only.
ai-g2-sae	2026-06-22	modal:autointerp/sae	2df597f	google/gemma-2-2b	layers.12	eval (delphi auto-interp, custom SAE)	16384	k=64	NeelNanda/pile-10k	~200k cache	0	Modal L4	~25 min	~$0.35	detection=0.540, fuzz=0.523 (58 latents, Qwen2.5-3B local scorer)	inconclusive	Near chance; matches repro-004 Gemma-Scope-SAE (0.544/0.529). Scorer-limited.
ai-g2-tc	2026-06-22	modal:autointerp/transcoder	2df597f	google/gemma-2-2b	layers.12	eval (delphi auto-interp, custom transcoder)	16384	k=64	NeelNanda/pile-10k	~200k cache	0	Modal L4	~25 min	~$0.35	detection=0.539, fuzz=0.546 (61 latents)	inconclusive	HEAD-TO-HEAD vs ai-g2-sae: detection Δ-0.001 CI95[-0.022,+0.022]; fuzz Δ+0.023 CI95[-0.001,+0.047]. Both CIs include 0 -> no significant SAE-vs-transcoder difference at this scorer scale.
ai-g2-7b-ATTEMPT	2026-06-23	modal:autointerp/sae+tc (7B scorer)	bf0826c	google/gemma-2-2b	layers.12	eval (delphi auto-interp, STRONGER local scorer)	16384	k=64	NeelNanda/pile-10k	~200k cache	0	Modal L4 24GB	2x ~5 min (died at startup)	~$0.15 total	NO RESULT on L4, vLLM scorer engine failed to start with Qwen2.5-7B	inconclusive (no result), RESOLVED by ai-g2-sae-7b / ai-g2-tc-7b on A100-40GB	Stronger-scorer attempt to move the near-chance auto-interp bottleneck. BLOCKER (not an OOM): delphi keeps the Gemma-2-2B base model resident on the GPU through scoring, leaving only ~16/22 GiB free on the L4; vLLM's request_memory guard rejects gpu_memory_utilization 0.9 (19.83>16.05 GiB free) and 0.5 underfits the 7B's ~14.3 GiB weights + KV cache. No max_memory fraction works while the base model is resident on a 24 GiB GPU. RESOLUTION: the 7B fits next to the resident base model on an A100-40GB (auto_interp_custom_a100, max_memory=0.65), see ai-g2-sae-7b / ai-g2-tc-7b below for the real scores. The L4 3B path (ai-g2-sae/tc) is unchanged and still reported alongside.
ai-g2-sae-7b	2026-06-23	modal:autointerp/sae (7B scorer)	(this commit)	google/gemma-2-2b	layers.12	eval (delphi auto-interp, custom SAE, 7B scorer)	16384	k=64	NeelNanda/pile-10k	~200k cache	0	Modal A100-40GB	~8 min	~$0.30	detection=0.6072, fuzz=0.6309 (58 latents, Qwen2.5-7B-Instruct, max_memory=0.65)	novel	Stronger LOCAL scorer (C1) on the SAE coder. Both metrics rise WELL above the 3B near-chance (det 0.540->0.607 +0.067; fuzz 0.523->0.631 +0.108): the 3B near-chance was a SCORER artifact, not a coder limit. Same pre-registered latent target as ai-g2-sae; delphi drops different unscoreable latents per coder so the head-to-head is unpaired (n=58). Resolves the ai-g2-7b-ATTEMPT blocker via auto_interp_custom_a100 (base model + 7B coexist on 40 GiB).
ai-g2-tc-7b	2026-06-23	modal:autointerp/transcoder (7B scorer)	(this commit)	google/gemma-2-2b	layers.12	eval (delphi auto-interp, custom transcoder, 7B scorer)	16384	k=64	NeelNanda/pile-10k	~200k cache	0	Modal A100-40GB	~8 min	~$0.30	detection=0.6602, fuzz=0.6895 (60 latents, Qwen2.5-7B-Instruct, max_memory=0.65)	novel	HEAD-TO-HEAD vs ai-g2-sae-7b (unpaired diff-of-means bootstrap, seed 0, 10k resamples, SAME method as the 3B ai-g2 row): detection TC-SAE=+0.053 CI95[+0.016,+0.089]; fuzz TC-SAE=+0.059 CI95[+0.019,+0.097]. Both CIs now EXCLUDE 0 => the skip-transcoder is significantly MORE interpretable than the SAE on both metrics. This FLIPS the 3B verdict (which was inconclusive, both CIs incl 0) and CONFIRMS the pre-registered Transcoders-Beat-SAEs hypothesis on the interpretability axis. recompute: scripts/headtohead_autointerp.py.
ctrl-probe-real	2026-06-22	modal:probing/real	(probing_eval)	google/gemma-2-2b	layers.12 resid (HF)	control (SAE-feature linear probe)	16384	k=64	LabHC/bias_in_bios (prof 21 vs 19)	600 ex	0	Modal L4	~8 min	~$0.10	sae_probe_acc=0.9333 (CI [0.894,0.967])	reproduced(control)	Real-model SAE features separate professions. Paired vs ctrl-probe-random below.
ctrl-steer	2026-06-22	modal:steer	(steer_eval)	google/gemma-2-2b	layers.12	control (steering: SAE-feat vs diff-of-means)	16384	k=64	LabHC/bias_in_bios (prof contrast)	n_gen=16/coef, coefs[2,4,8]xRMS	0	Modal L4	~15 min	~$0.25	SAE-dom success diff=0.0 CI95[-0.25,+0.25]; both best-coef=0 (no steering beat baseline within fluency); baseline_success=0.812	inconclusive	SUPERSEDED by ctrl-steer-v2. Control B (ADR-0005) ran end-to-end but DEGENERATE: "My favorite" prompt baseline already 0.812 (probe ceiling, no headroom) + coarse coefs [2,4,8] all broke the fluency cap => both best-coef=0, diff trivially 0.0. Re-run recalibrated (neutral prompt + finer grid) as ctrl-steer-v2.
ctrl-steer-v2	2026-06-23	modal:steer (recalibrated)	(steer_eval)	google/gemma-2-2b	layers.12	control (steering: SAE-feat vs diff-of-means)	16384	k=64	LabHC/bias_in_bios (prof 21 vs 19, steer->19)	n_scan=8/prompt, n_gen=16/coef, coefs[0.5,1,2,3,4]xRMS	0	Modal L4	~30 min	~$0.45	neutral prompt='This person' (baseline_success=0.562, ppl 9.18, cap 13.76); SAE best coef0.5 success=0.875 (effect +0.312); dom best coef0.5 success=0.938 (effect +0.375); SAE-dom=-0.062 CI95[-0.25,+0.125]	inconclusive	CALIBRATION FIX of ctrl-steer (same ADR-0005 metric/concept, NOT a new Gate-4 decision). Prompt scan over 6 candidates picked the one closest to 0.5 baseline ('This person'=0.5; vs 'My favorite'=1.0, 'I'=1.0 ceilings). Finer grid => both directions now show large fluency-preserving steering effects (no longer degenerate); coefs>=1-2 break fluency (ppl 16->2268). Head-to-head: dom matches/slightly beats the SAE feature, CI incl 0 (R4 honest, AxBench expectation). top_feature=795, resid_rms=104.4.
circuit-g2-sae	2026-06-22	modal:circuit	(circuit_eval)	google/gemma-2-2b	layers.12	circuit (SAE-feature faithfulness vs random)	16384	k=64	LabHC/bias_in_bios (prof 21v19)	600 ex	0	Modal L4	~10 min	~$0.12	top5=0.878 (94% of 0.933 ceiling) vs random5=0.583 gap+0.294 CI[0.206,0.383]; top10=0.906 (97%); top20=0.906; top50=0.922; ALL K beat random (CI excl 0)	novel	SPARSE FAITHFUL circuit (ADR-0006): ~5-10 SAE features carry the profession distinction. Circuit ids [3955,1649,1962,5409,6053,14086,7688,4295,2258,11850]. Caveat: same token-influenced behavior as Control A => partly token features.
saebench-custom-sae (v1, BUGGY)	2026-06-23	modal:saebench-sparseprobing-custom/sae	661119c	google/gemma-2-2b	blocks.12.hook_resid_post	eval (SAEBench sparse_probing, CUSTOM SAE via sparsify->sae_lens adapter)	16384	k=64	LabHC/bias_in_bios_class_set1	probe 1500/500	42	Modal L4 24GB	~6 min	~~$0.5 (+~~$0.2 E4 probes/pre-flight)	sae_top_1=0.6668 (ARTIFACT)	superseded	SUPERSEDED by saebench-custom-sae-v2, the adapter had an encode bug. `_sparsify_to_topk_sae` set `apply_b_dec_to_input=False` on the premise that sparsify's TopK encode does not subtract a decoder bias. That premise is FALSE: the installed sparsify `SparseCoder.encode` does `if not self.cfg.transcode: x = x - self.b_dec` before the fused encoder (verified in source, probe_sparsify_encode). With `=False` the adapter encoded `x@Wencᵀ+b_enc` while sparsify's true encode is `(x-b_dec)@Wencᵀ+b_enc` (b_dec norm ≈ 90.7), so ~93% DIFFERENT TopK latents fired (encode-fidelity Jaccard ≈ 0.07, cosine 0.139), i.e. 0.6668 was an adapter artifact, not the real SAE's number.
circuit-multilayer	2026-06-23	modal:circuit-multilayer/l5_12_19/w16k	(this commit)	google/gemma-2-2b	blocks.{5,12,19}.hook_resid_post (TransformerLens)	circuit (multi-layer Gemma Scope SAE feature-SET, faithfulness vs random + build-up)	16k x3 (49152 total)	canonical Gemma Scope (~56-62 active/tok)	LabHC/bias_in_bios (prof 21 vs 19)	600 ex	0	Modal L4 24GB	~13 min	~~$0.2 (+~~$0.15 E4 probe + 1 import-fail)	ceiling(49152 feats)=0.9444; K/layer=3 (9 nodes) circuit=0.9167 (97.1% of ceiling) vs random=0.5944 gap+0.322 CI95[0.239,0.406]; K/layer=5 (15) circuit=0.9389 (99.4%) vs random=0.7778 gap+0.161 CI95[0.094,0.233]; K/layer=10 (30) circuit=0.9500 (100.6%) vs random=0.6667 gap+0.283 CI95[0.206,0.356]; ALL K beat random (CI excl 0). Build-up (K/layer=5): L5=0.9111, L5+L12=0.9389, L5+L12+L19=0.9389	novel	MULTI-LAYER (cross-layer feature-SET) circuit (ADR-0008), the deferred extension of the single-layer circuit-g2-sae. Uses PRETRAINED Gemma Scope SAEs at L5/12/19 (the 3 layers reproduced in repro-002) on the TransformerLens resid_post recipe (BOS excluded; raw HF acts gave VE -4.5 so this recipe is mandatory). Probe-independent attribution (class-mean diff) per layer -> union of per-layer top-K = circuit; fresh probe on circuit features (concat across layers) vs same-size RANDOM cross-layer set vs full ceiling, bootstrap CI on circuit-minus-random gap (R2/R3). A small cross-layer set (9-30 features over 3 layers) is FAITHFUL (97-101% of ceiling) and beats the random cross-layer control at every K (CI excl 0) => sparse multi-layer circuit. Build-up curve: the profession concept is essentially built by L5->L12 (L5 alone 0.911; adding L12 -> 0.939; L19 adds nothing on top, +0.000), i.e. it accumulates by mid-depth and saturates. SCOPE (R4): this is a cross-layer feature-SET circuit + depth build-up, NOT a feature->feature causal edge graph (the heavier sparse-feature-circuits/attribution-patching version is the remaining follow-up). Same token-influence caveat as Control A (much of this profession signal is token-level). Top features per layer (K/layer=5): L5 [12872,5411,14908,28,807], L12 [6810,23,5364,1041,10603], L19 [4346,10992,12025,7663,14180]. result: /root/outputs/circuit_multilayer.json.
saebench-custom-sae-v2	2026-06-23	modal:saebench-sparseprobing-custom/sae	(this commit)	google/gemma-2-2b	blocks.12.hook_resid_post	eval (SAEBench sparse_probing, CUSTOM SAE via sparsify->sae_lens adapter, b_dec FIXED)	16384	k=64	LabHC/bias_in_bios_class_set1	probe 1500/500	42	Modal L4 24GB	~6 min	~~$0.15 (+~~$0.05 verify)	sae_top_1=0.670; residual(llm) baseline top_1=0.6876; full-feat sae=0.9496 / llm=0.9648	novel	CORRECTED ADR-0007 result. Adapter now sets `apply_b_dec_to_input=True` so sae_lens performs the SAME `(x - b_dec)` shift sparsify does (b_dec is copied into the SAE). ENCODE-FIDELITY (the bug-catching check, `verify_saebench_adapter`, persisted to saebench_adapter_verify.json): adapter.encode reproduces sparsify `coder.encode` EXACTLY on random AND real-resid batches, same active TopK indices (Jaccard 1.0 every row), max abs value diff 6e-6 (random) / 7.6e-5 (real-resid), cosine 1.0. The buggy `=False` variant FAILS the same check (Jaccard ≈ 0.07, cosine 0.139), confirmed in the same run. HONEST RESULT (R4): the budget-trained custom SAE (recon VE 0.51) scores 0.670 < Gemma Scope 0.767 AND < its own residual baseline 0.688, on this single-dataset top-1 probe the budget SAE's best feature does NOT beat the raw residual (opposite of repro-003). The b_dec fix moved the number only +0.003 (0.667→0.670: a top-1 best-single-feature probe is robust to which near-equivalent budget latents are selected), but the result now rests on a verified-correct encode rather than an artifact. Baseline 0.6876 == repro-003's baseline (SAE-independent) => eval is sound + comparable. Transcoder = N/A (R3). Decoder-norm note: mean row norm ~1.004, a few rows drift ~0.07 => check_decoder_norms warns (does not raise). result: /root/outputs/saebench_custom_sae.json; fidelity: /root/outputs/saebench_adapter_verify.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXPERIMENTS

FilesExpand file tree

EXPERIMENTS.md

Latest commit

History

EXPERIMENTS.md

File metadata and controls

EXPERIMENTS