From 1b7f82e29de40bbb1b7c939eb069fbb7b1c40291 Mon Sep 17 00:00:00 2001
From: heznpc <heznpc@gmail.com>
Date: Fri, 29 May 2026 07:31:39 +0900
Subject: [PATCH] fix(z-gap): close all 15 findings from xhigh-recall code
 review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Critical fixes (paper claim integrity):
- V8: Strategy D/E/F sys.exit(2) on any failed model unless
      Z_GAP_ALLOW_PARTIAL_RESULTS=1. Holm-Bonferroni's "across 35 cells"
      claim can no longer be silently invalidated by a model dropout.
- V1/V14: SentenceTransformerEmbedder.name now includes full repo path
      (org/name → org__name) and `@<sha8>` revision suffix. C3 closure
      (PR #7) now actually holds end-to-end against SHA bumps and across
      org/name basename collisions.
- V2:  run_cross_experiment_synthesis._normalize_results_envelope()
      unwraps {_meta, results} so legacy consumers keep working with the
      new D/E/F JSON shape; strategy_e and strategy_f added to known files.
- V3:  paper §5.5 / Limitations "20 cells / four models / 20/20" drift
      updated to "35 cells / seven models / 35/35 + OOD 35/35" in 3 places
      (L463 method, L516 P2-resolution, L663 Limitations).

Statistical fixes (correctness):
- V5:  compute_per_language_R_code substitutes NaN (not 1.0) when
      d_match_perm is empty or bootstrap mean_m ≤ 1e-10; np.nanmean for
      random_baseline_R aggregation. New null R range [1.0001, 1.0046]
      (tier1) / [1.0005, 1.0086] (OOD), unbiased by silent 1.0 imputations.
- V6:  permutation p-value uses (k+1)/(n+1) convention. No cell reports
      literal 0.0 anymore (verified post-rerun: min nonzero p = 0.0001
      across all 70 D+F cells). Reviewer push-back surface closed.

Robustness / pitfall fixes:
- V7:  Strategy D/E/F save JSON BEFORE generating figures; figures in
      try/except. Multi-hour compute no longer lost to matplotlib fail.
- V9:  MistralEmbedder Retry sets respect_retry_after_header=False;
      bounded by backoff_factor=1 (~31s worst case), eliminating the
      multi-hour silent stall mode on server-sent Retry-After.
- V10: OpenAI client timeout 60s → 300s; legacy batch callers no longer
      regress on slow server-side processing.
- V11: Strategy E replaces categories[op_id] with categories.get() +
      _label() helper. Empty test sets produce {skip:true} cells with
      NaN accuracy instead of crashing on clf.predict(np.array([])).
- V12: load_ood_stimuli() asserts tier2/tier3 op_id uniqueness with the
      duplicate list in the error message. 50/50 unique today; future
      collision will fail loudly.
- V13: EmbeddingCache._key() switched from `|`-joined string to a
      JSON-encoded payload hash. ['a|b','c'] and ['a','b|c'] now hash
      to distinct keys.
- V18: SentenceTransformerEmbedder.dimension falls back to a single-text
      encode probe when the deprecated get_sentence_embedding_dimension()
      returns None. Nomic v1.5 no longer risks int(None) silent skip.
- V20: synthesis script counts aggregate as a 6th language → explicit
      `if lang == "aggregate": continue` in the per-language counter.

Hygiene:
- V4:  Strategy D datetime.datetime.utcnow() → datetime.now(datetime.UTC),
      matching E/F and surviving future Python ≥3.13 removal.

Re-execution:
- D/E/F rerun successfully (7/7 models each, ~5 min wall time).
- 35/35 + multi-model P3 + 35/35 OOD all preserved.
- 2-decimal R_code values unchanged except UniXcoder tier1 (1.0649 ≈ 1.06,
  was printed 1.07).
- OOD Cohen's d_max E5-large 4.12 → E5-base 4.42; paper updated.

Decisions log:
- planning/decisions.md: 2026-05-21 entry covering all 15 fixes with
  per-finding rationale and the re-execution outcome.
---
 .../scripts/run_cross_experiment_synthesis.py | 19 +++-
 .../scripts/run_strategy_d_code_alignment.py  | 36 +++++---
 .../run_strategy_e_multimodel_probing.py      | 90 ++++++++++++++++---
 .../scripts/run_strategy_f_ood_alignment.py   | 35 ++++++--
 experiments/src/code_alignment.py             | 39 ++++++--
 experiments/src/embeddings.py                 | 83 ++++++++++++++---
 paper/main.tex                                | 16 ++--
 planning/decisions.md                         | 42 +++++++++
 8 files changed, 301 insertions(+), 59 deletions(-)
diff --git a/experiments/scripts/run_cross_experiment_synthesis.py b/experiments/scripts/run_cross_experiment_synthesis.py
index 25eac53..865277a 100644
--- a/experiments/scripts/run_cross_experiment_synthesis.py
+++ b/experiments/scripts/run_cross_experiment_synthesis.py
@@ -42,6 +42,17 @@ def load_json(name: str) -> dict | list:
 # Load all results
 # ────────────────────────────────────────────────────────────
 
+def _normalize_results_envelope(payload):
+    """V2 (review-2026-05-21): unwrap the {_meta, results} envelope used by
+    Strategy D / E / F so legacy consumers expecting a plain list keep working.
+    Strategy D was originally a list of model_results; PR #4+ wraps it in
+    {"_meta": ..., "results": [...]}. This shim handles both shapes.
+    """
+    if isinstance(payload, dict) and "results" in payload and "_meta" in payload:
+        return payload["results"]
+    return payload
+
+
 def load_all_results():
     return {
         "prediction": load_json("prediction_results.json"),
@@ -52,7 +63,9 @@ def load_all_results():
         "strategy_a": load_json("strategy_a_vocab_mediation.json"),
         "strategy_2": load_json("strategy2_langpair_results.json"),
         "strategy_4": load_json("strategy4_prereq_results.json"),
-        "strategy_d": load_json("strategy_d_code_alignment.json"),
+        "strategy_d": _normalize_results_envelope(load_json("strategy_d_code_alignment.json")),
+        "strategy_e": _normalize_results_envelope(load_json("strategy_e_multimodel_probing.json")),
+        "strategy_f": _normalize_results_envelope(load_json("strategy_f_ood_alignment.json")),
         "strategy_6r": load_json("strategy_6r_dialect_results.json"),
         "rcode_token": load_json("rcode_token_control.json"),
     }
@@ -98,6 +111,10 @@ def build_master_summary(results: dict) -> list[dict]:
         for model_result in strat_d:
             per_lang = model_result.get("per_language", {})
             for lang, stats in per_lang.items():
+                # V20 (review-2026-05-21): skip the "aggregate" pseudo-key
+                # written by compute_per_language_R_code; it is not a cell.
+                if lang == "aggregate":
+                    continue
                 if isinstance(stats, dict) and not stats.get("skip"):
                     total_cells += 1
                     if stats.get("p_corrected", 1.0) < 0.05:
diff --git a/experiments/scripts/run_strategy_d_code_alignment.py b/experiments/scripts/run_strategy_d_code_alignment.py
index b48a5cb..d0334c5 100644
--- a/experiments/scripts/run_strategy_d_code_alignment.py
+++ b/experiments/scripts/run_strategy_d_code_alignment.py
@@ -218,7 +218,7 @@ def _build_run_meta() -> dict:
     except Exception:
         torch_version = "unknown"
     return {
-        "started_at_utc": datetime.datetime.utcnow().isoformat() + "Z",
+        "started_at_utc": datetime.datetime.now(datetime.UTC).isoformat(),
         "python": platform.python_version(),
         "platform": platform.platform(),
         "sentence_transformers": st_version,
@@ -281,6 +281,22 @@ def main():
         for (mi, lang), p_corr in zip(p_index, corrected):
             all_results[mi]["per_language"][lang]["p_corrected"] = p_corr
 
+    # V8 (review-2026-05-21): refuse to publish results if any model failed.
+    # Holm-Bonferroni's family-wise denominator depends on the full N; a
+    # partial run would silently invalidate the paper's "across 35 cells"
+    # claim. Set Z_GAP_ALLOW_PARTIAL_RESULTS=1 to override (e.g. debugging).
+    import os as _os
+    if failed_models and _os.environ.get("Z_GAP_ALLOW_PARTIAL_RESULTS") != "1":
+        print(
+            f"\n[FATAL] {len(failed_models)}/{len(MODELS)} model(s) failed; "
+            f"refusing to write partial results.\n"
+            f"        Failed: {[f['label'] for f in failed_models]}\n"
+            f"        Holm-Bonferroni denominator depends on full N.\n"
+            f"        Set Z_GAP_ALLOW_PARTIAL_RESULTS=1 to override.",
+            file=sys.stderr,
+        )
+        sys.exit(2)
+
     # Summary
     print(f"\n{'='*60}")
     print("CROSS-MODEL SUMMARY (Holm-Bonferroni corrected)")
@@ -311,10 +327,8 @@ def main():
 
     print(f"\n  R_code > 1 and significant: {n_supported}/{n_total} cells")
 
-    # Figures
-    make_figures(all_results)
-
-    # Save
+    # V7 (review-2026-05-21): save JSON BEFORE generating figures so a
+    # matplotlib failure does not discard hours of compute.
     RESULTS_DIR.mkdir(parents=True, exist_ok=True)
     out_path = RESULTS_DIR / "strategy_d_code_alignment.json"
 
@@ -325,7 +339,7 @@ def _convert(obj):
         if isinstance(obj, (np.bool_,)): return bool(obj)
         return obj
 
-    run_meta["finished_at_utc"] = datetime.datetime.utcnow().isoformat() + "Z"
+    run_meta["finished_at_utc"] = datetime.datetime.now(datetime.UTC).isoformat()
     run_meta["n_models_attempted"] = len(MODELS)
     run_meta["n_models_succeeded"] = len(all_results)
     run_meta["failed_models"] = failed_models
@@ -334,10 +348,12 @@ def _convert(obj):
     with open(out_path, "w") as f:
         json.dump(payload, f, indent=2, default=_convert)
     print(f"\n  Results saved: {out_path}")
-    if failed_models:
-        print(f"  [WARN] {len(failed_models)} model(s) skipped due to errors:")
-        for err in failed_models:
-            print(f"    - {err['label']}: {err['error_type']}")
+
+    # Figures last (best-effort, isolated from results JSON).
+    try:
+        make_figures(all_results)
+    except Exception as e:  # noqa: BLE001
+        print(f"  [WARN] make_figures failed: {type(e).__name__}: {e}", file=sys.stderr)
 
 
 if __name__ == "__main__":
diff --git a/experiments/scripts/run_strategy_e_multimodel_probing.py b/experiments/scripts/run_strategy_e_multimodel_probing.py
index b012f81..068e279 100644
--- a/experiments/scripts/run_strategy_e_multimodel_probing.py
+++ b/experiments/scripts/run_strategy_e_multimodel_probing.py
@@ -89,15 +89,39 @@ def run_model_probing(model_name: str, label: str, kwargs: dict) -> dict:
     embeddings = {k: embeddings_array[i] for i, k in enumerate(keys)}
     print(f"  {len(embeddings)} NL embeddings ready ({len(ops)} ops × {len(LANGUAGES)} langs)")
 
+    # V11 (review-2026-05-21): guard against missing categories. The
+    # original `categories[op_id]` raised KeyError on any op without a
+    # category field, which the outer try/except silently classified as a
+    # whole-model failure. We now skip the op explicitly and surface a
+    # warning so the failure mode is visible.
+    def _label(op_id: str) -> int | None:
+        cat = categories.get(op_id)
+        if cat is None:
+            return None
+        if cat not in ("computational", "judgment"):
+            return None
+        return 1 if cat == "computational" else 0
+
+    skipped_ops_train = []
     # --- Probe 1: category (chance 50%) ---
     X_train, y_train = [], []
     for op_id in all_ids:
         key = f"{op_id}_en"
         if key in embeddings:
+            lbl = _label(op_id)
+            if lbl is None:
+                skipped_ops_train.append(op_id)
+                continue
             X_train.append(embeddings[key])
-            y_train.append(1 if categories[op_id] == "computational" else 0)
+            y_train.append(lbl)
+    if skipped_ops_train:
+        print(f"  [WARN] skipped {len(skipped_ops_train)} train ops with unknown category: "
+              f"{skipped_ops_train[:5]}{'...' if len(skipped_ops_train) > 5 else ''}",
+              file=sys.stderr)
     X_train = np.array(X_train)
     y_train = np.array(y_train)
+    if len(X_train) == 0:
+        raise RuntimeError("no labeled training samples — every op had an unknown category")
 
     clf_cat = LogisticRegression(max_iter=2000, random_state=SEED, C=1.0)
     clf_cat.fit(X_train, y_train)
@@ -107,8 +131,21 @@ def run_model_probing(model_name: str, label: str, kwargs: dict) -> dict:
         for op_id in all_ids:
             key = f"{op_id}_{lang}"
             if key in embeddings:
+                lbl = _label(op_id)
+                if lbl is None:
+                    continue
                 X_test.append(embeddings[key])
-                y_test.append(1 if categories[op_id] == "computational" else 0)
+                y_test.append(lbl)
+        # V11: guard empty test set so the script reports it instead of
+        # crashing on `clf.predict(np.array([]))`.
+        if not X_test:
+            cat_results[lang] = {
+                "accuracy": float("nan"),
+                "n_correct": 0, "n_total": 0,
+                "p_value_vs_chance": float("nan"),
+                "skip": True,
+            }
+            continue
         X_test = np.array(X_test)
         y_test = np.array(y_test)
         preds = clf_cat.predict(X_test)
@@ -122,7 +159,9 @@ def run_model_probing(model_name: str, label: str, kwargs: dict) -> dict:
             "p_value_vs_chance": _binomial_p_vs_chance(n_correct, n_total, 0.5),
         }
 
-    cat_transfer = float(np.mean([r["accuracy"] for lang, r in cat_results.items() if lang != "en"]))
+    _non_en_cat = [r["accuracy"] for lang, r in cat_results.items()
+                   if lang != "en" and not r.get("skip")]
+    cat_transfer = float(np.nanmean(_non_en_cat)) if _non_en_cat else float("nan")
 
     # --- Probe 2: operation identity (chance 1%) ---
     op_to_idx = {op_id: i for i, op_id in enumerate(all_ids)}
@@ -146,6 +185,15 @@ def run_model_probing(model_name: str, label: str, kwargs: dict) -> dict:
             if key in embeddings:
                 X_test.append(embeddings[key])
                 y_test.append(op_to_idx[op_id])
+        # V11: guard empty test set.
+        if not X_test:
+            op_results[lang] = {
+                "accuracy": float("nan"),
+                "n_correct": 0, "n_total": 0,
+                "p_value_vs_chance": float("nan"),
+                "skip": True,
+            }
+            continue
         X_test = np.array(X_test)
         y_test = np.array(y_test)
         preds = clf_op.predict(X_test)
@@ -158,7 +206,9 @@ def run_model_probing(model_name: str, label: str, kwargs: dict) -> dict:
             "n_total": n_total,
             "p_value_vs_chance": _binomial_p_vs_chance(n_correct, n_total, chance_op),
         }
-    op_transfer = float(np.mean([r["accuracy"] for lang, r in op_results.items() if lang != "en"]))
+    _non_en_op = [r["accuracy"] for lang, r in op_results.items()
+                  if lang != "en" and not r.get("skip")]
+    op_transfer = float(np.nanmean(_non_en_op)) if _non_en_op else float("nan")
 
     # Print
     print(f"\n  Probe 1 (category, chance 50%):")
@@ -249,7 +299,11 @@ def make_heatmaps(all_results: list[dict]):
         for mi, res in enumerate(all_results):
             labels.append(res["label"])
             for li, lang in enumerate(LANGUAGES):
-                matrix[mi, li] = res[probe_key]["per_language"][lang]["accuracy"]
+                # V11: per_language may have been skipped (empty test set);
+                # fall back to NaN so seaborn shows a blank cell instead of
+                # KeyError on a missing key.
+                cell = res[probe_key]["per_language"].get(lang, {})
+                matrix[mi, li] = cell.get("accuracy", float("nan"))
         sns.heatmap(
             matrix, annot=True, fmt=".2f", cmap="YlGn",
             xticklabels=LANGUAGES, yticklabels=labels,
@@ -298,6 +352,18 @@ def main():
             )
             gc.collect()
 
+    # V8 (review-2026-05-21): same partial-success guard as Strategy D.
+    import os as _os
+    if failed_models and _os.environ.get("Z_GAP_ALLOW_PARTIAL_RESULTS") != "1":
+        print(
+            f"\n[FATAL] {len(failed_models)}/{len(MODELS)} model(s) failed; "
+            f"refusing to write partial Strategy E results.\n"
+            f"        Failed: {[f['label'] for f in failed_models]}\n"
+            f"        Set Z_GAP_ALLOW_PARTIAL_RESULTS=1 to override.",
+            file=sys.stderr,
+        )
+        sys.exit(2)
+
     # Summary
     print(f"\n{'='*60}")
     print("CROSS-MODEL P3 SUMMARY")
@@ -311,9 +377,8 @@ def main():
         op_xfer = res["operation_probe"]["mean_transfer"]
         print(f"{res['label']:<25s}  {cat_en:>7.3f}  {cat_xfer:>12.3f}  {op_en:>6.3f}  {op_xfer:>12.3f}")
 
-    make_heatmaps(all_results)
-
-    # Save
+    # V7 (review-2026-05-21): save BEFORE figures so a matplotlib failure
+    # does not lose the probing results.
     RESULTS_DIR.mkdir(parents=True, exist_ok=True)
     run_meta["finished_at_utc"] = datetime.datetime.now(datetime.UTC).isoformat()
     run_meta["n_models_attempted"] = len(MODELS)
@@ -333,10 +398,11 @@ def _convert(obj):
     with open(out_path, "w") as f:
         json.dump(payload, f, indent=2, default=_convert)
     print(f"\n  Results saved: {out_path}")
-    if failed_models:
-        print(f"  [WARN] {len(failed_models)} model(s) skipped:")
-        for err in failed_models:
-            print(f"    - {err['label']}: {err['error_type']}")
+
+    try:
+        make_heatmaps(all_results)
+    except Exception as e:  # noqa: BLE001
+        print(f"  [WARN] make_heatmaps failed: {type(e).__name__}: {e}", file=sys.stderr)
 
 
 if __name__ == "__main__":
diff --git a/experiments/scripts/run_strategy_f_ood_alignment.py b/experiments/scripts/run_strategy_f_ood_alignment.py
index df94f79..765b5e8 100644
--- a/experiments/scripts/run_strategy_f_ood_alignment.py
+++ b/experiments/scripts/run_strategy_f_ood_alignment.py
@@ -75,6 +75,14 @@ def load_ood_stimuli() -> tuple[list[dict], dict[str, str]]:
     with open(DATA_DIR / "tier3_compositional.json") as f:
         tier3 = json.load(f)
     ops = tier2 + tier3
+    # V12 (review-2026-05-21): assert op_id uniqueness across the two tiers
+    # so a future id collision does not silently double-count pairings in
+    # compute_per_language_R_code.
+    op_ids = [op["id"] for op in ops]
+    if len(set(op_ids)) != len(op_ids):
+        from collections import Counter
+        dups = [k for k, v in Counter(op_ids).items() if v > 1]
+        raise ValueError(f"tier2/tier3 op_id collision: {dups}")
     code_equivalents = {op["id"]: op["code"] for op in ops}
     return ops, code_equivalents
 
@@ -253,6 +261,19 @@ def main():
             )
             gc.collect()
 
+    # V8 (review-2026-05-21): refuse partial results so paper's "35/35 OOD
+    # cells" claim is never silently invalidated by a model dropout.
+    import os as _os
+    if failed and _os.environ.get("Z_GAP_ALLOW_PARTIAL_RESULTS") != "1":
+        print(
+            f"\n[FATAL] {len(failed)}/{len(MODELS)} model(s) failed; "
+            f"refusing to write partial Strategy F results.\n"
+            f"        Failed: {[f['label'] for f in failed]}\n"
+            f"        Set Z_GAP_ALLOW_PARTIAL_RESULTS=1 to override.",
+            file=sys.stderr,
+        )
+        sys.exit(2)
+
     # Holm-Bonferroni across all (model, language) cells
     all_p, p_index = [], []
     for mi, res in enumerate(all_results):
@@ -298,9 +319,7 @@ def main():
     print(f"\n  OOD R_code > 1 and significant: {n_sig}/{n_total} cells")
     print(f"  (Strategy D tier1 baseline: 35/35 cells)")
 
-    make_figure(all_results)
-
-    # Save
+    # V7 (review-2026-05-21): save BEFORE figures.
     RESULTS_DIR.mkdir(parents=True, exist_ok=True)
     run_meta["finished_at_utc"] = datetime.datetime.now(datetime.UTC).isoformat()
     run_meta["n_models_attempted"] = len(MODELS)
@@ -323,10 +342,12 @@ def _convert(obj):
     with open(out_path, "w") as f:
         json.dump(payload, f, indent=2, default=_convert)
     print(f"\n  Results saved: {out_path}")
-    if failed:
-        print(f"  [WARN] {len(failed)} model(s) skipped:")
-        for err in failed:
-            print(f"    - {err['label']}: {err['error_type']}")
+
+    # Figures last (best-effort).
+    try:
+        make_figure(all_results)
+    except Exception as e:  # noqa: BLE001
+        print(f"  [WARN] make_figure failed: {type(e).__name__}: {e}", file=sys.stderr)
 
 
 if __name__ == "__main__":
diff --git a/experiments/src/code_alignment.py b/experiments/src/code_alignment.py
index 0c26cd5..cf74d4b 100644
--- a/experiments/src/code_alignment.py
+++ b/experiments/src/code_alignment.py
@@ -172,7 +172,10 @@ def compute_per_language_R_code(
         d_mismatch_arr = np.array(d_mismatch)
         observed_R = float(np.mean(d_mismatch_arr) / np.mean(d_match_arr))
 
-        # Permutation test: shuffle which code each NL is "matched" to
+        # Permutation test: shuffle which code each NL is "matched" to.
+        # V5 (review-2026-05-21): substitute NaN (not 1.0) when a permutation
+        # produces an empty d_match_perm, then drop NaNs before computing
+        # the p-value so the null distribution is not biased toward 1.0.
         perm_Rs = np.empty(n_perm)
         for i in range(n_perm):
             shuffled = rng.permutation(valid_ids)
@@ -186,17 +189,32 @@ def compute_per_language_R_code(
             if d_match_perm:
                 perm_Rs[i] = np.mean(d_mismatch_arr) / np.mean(d_match_perm)
             else:
-                perm_Rs[i] = 1.0
-        p_value = float(np.mean(perm_Rs >= observed_R))
+                perm_Rs[i] = np.nan
+        valid_perm = perm_Rs[~np.isnan(perm_Rs)]
+        n_extreme = int(np.sum(valid_perm >= observed_R))
+        # V6 (review-2026-05-21): use the (k+1)/(n+1) convention so the
+        # reported p_value is bounded below by 1/(n_valid+1) and is never
+        # literal 0.0 — that lower bound is what reviewers expect from a
+        # permutation test with n_perm=10,000.
+        n_valid = int(len(valid_perm))
+        p_value = float((n_extreme + 1) / (n_valid + 1)) if n_valid > 0 else float("nan")
 
-        # Bootstrap CI for R_code
+        # Bootstrap CI for R_code.
+        # V5: NaN fallback for degenerate mean_m so the bootstrap CI is not
+        # silently pulled toward 1.0.
         boot_Rs = np.empty(n_boot)
         for i in range(n_boot):
             idx_m = rng.integers(0, len(d_match_arr), size=len(d_match_arr))
             idx_mm = rng.integers(0, len(d_mismatch_arr), size=len(d_mismatch_arr))
             mean_m = np.mean(d_match_arr[idx_m])
-            boot_Rs[i] = np.mean(d_mismatch_arr[idx_mm]) / mean_m if mean_m > 1e-10 else 1.0
-        ci_lo, ci_hi = float(np.percentile(boot_Rs, 2.5)), float(np.percentile(boot_Rs, 97.5))
+            boot_Rs[i] = np.mean(d_mismatch_arr[idx_mm]) / mean_m if mean_m > 1e-10 else np.nan
+        valid_boot = boot_Rs[~np.isnan(boot_Rs)]
+        if len(valid_boot) > 0:
+            ci_lo = float(np.percentile(valid_boot, 2.5))
+            ci_hi = float(np.percentile(valid_boot, 97.5))
+        else:
+            ci_lo = float("nan")
+            ci_hi = float("nan")
 
         # Cohen's d
         s_pooled = np.sqrt(
@@ -221,9 +239,12 @@ def compute_per_language_R_code(
             # pairings produce the same mean(d_mismatch)/mean(d_match) ratio as
             # matched pairings. Used in paper §5.5 to anchor R_code = 1 as the
             # null line rather than as an asserted-but-unmeasured baseline.
-            "random_baseline_R_mean": float(np.mean(perm_Rs)),
-            "random_baseline_R_std": float(np.std(perm_Rs)),
-            "random_baseline_R_p95": float(np.percentile(perm_Rs, 95)),
+            # V5: NaN-safe aggregation across the valid permutations.
+            "random_baseline_R_mean": float(np.nanmean(perm_Rs)) if n_valid > 0 else float("nan"),
+            "random_baseline_R_std": float(np.nanstd(perm_Rs)) if n_valid > 0 else float("nan"),
+            "random_baseline_R_p95": float(np.nanpercentile(perm_Rs, 95)) if n_valid > 0 else float("nan"),
+            "n_perm_valid": n_valid,
+            "n_boot_valid": int(len(valid_boot)),
         }
 
     # Aggregate (all languages pooled)
diff --git a/experiments/src/embeddings.py b/experiments/src/embeddings.py
index ea3c5b6..83a2a73 100644
--- a/experiments/src/embeddings.py
+++ b/experiments/src/embeddings.py
@@ -1,8 +1,22 @@
-"""Embedding model interfaces: sentence-transformers + OpenAI."""
+"""Embedding model interfaces: sentence-transformers + OpenAI + Mistral.
 
+Code-review-2026-05-21 fixes applied:
+- V1/V14: cache key / model.name now include revision SHA and full org path
+- V13: EmbeddingCache._key() uses JSON-encoded payload (delimiter-collision-free)
+- V18: dimension property falls back to encode-probe when sentence-transformers
+       deprecated `get_sentence_embedding_dimension()` returns None
+- V9:  Mistral retry no longer honors server `Retry-After` (bounded by backoff
+       only), avoiding multi-hour stalls
+- V10: OpenAI timeout 60s → 300s to avoid regressing legacy batch callers
+"""
+
+from __future__ import annotations
+
+import hashlib
+import json as _json
 from abc import ABC, abstractmethod
 from pathlib import Path
-import hashlib
+
 import numpy as np
 
 
@@ -27,6 +41,9 @@ class SentenceTransformerEmbedder(EmbeddingModel):
     def __init__(self, model_name: str = "paraphrase-multilingual-MiniLM-L12-v2", **kwargs):
         from sentence_transformers import SentenceTransformer
         self._model_name = model_name
+        # V1: capture revision (if any) BEFORE forwarding kwargs so .name can
+        # encode it. The SentenceTransformer ctor itself also accepts it.
+        self._revision = kwargs.get("revision", None)
         self._model = SentenceTransformer(model_name, **kwargs)
 
     def encode(self, texts: list[str]) -> np.ndarray:
@@ -34,11 +51,29 @@ def encode(self, texts: list[str]) -> np.ndarray:
 
     @property
     def name(self) -> str:
-        return f"st_{self._model_name.split('/')[-1]}"
+        # V14: keep the full repo path so `intfloat/e5-large` and
+        # `sentence-transformers/e5-large` do not collide.
+        # V1: append revision short SHA so revision bumps produce a new cache key.
+        base = f"st_{self._model_name.replace('/', '__')}"
+        if self._revision:
+            return f"{base}@{self._revision[:8]}"
+        return f"{base}@unpinned"
 
     @property
     def dimension(self) -> int:
-        return self._model.get_sentence_embedding_dimension()
+        # V18: get_sentence_embedding_dimension() is deprecated in
+        # sentence-transformers ≥5.5 and returns None for some
+        # trust_remote_code custom modules. Fall back to a single-token
+        # encode probe so callers get an int.
+        try:
+            d = self._model.get_sentence_embedding_dimension()
+            if d is not None:
+                return int(d)
+        except Exception:
+            pass
+        # Fallback: probe shape from a 1-text encode.
+        probe = self._model.encode(["probe"], show_progress_bar=False, normalize_embeddings=False)
+        return int(np.asarray(probe).shape[-1])
 
 
 class OpenAIEmbedder(EmbeddingModel):
@@ -48,9 +83,10 @@ def __init__(self, model: str = "text-embedding-3-small"):
         import openai
         from dotenv import load_dotenv
         load_dotenv()
-        # max_retries=5 covers transient 429/5xx during multi-model sweeps.
-        # SDK uses exponential backoff with jitter internally.
-        self._client = openai.OpenAI(max_retries=5, timeout=60.0)
+        # V10: 60s previously regressed legacy batch callers whose ≥100-text
+        # batches occasionally cross 60s on server-side. 300s covers them
+        # while still bounded; SDK exponential-backoff retry handles 429/5xx.
+        self._client = openai.OpenAI(max_retries=5, timeout=300.0)
         self._model = model
         self._dim = 1536 if "small" in model else 3072
 
@@ -89,17 +125,23 @@ def __init__(self, model: str = "codestral-embed-2505"):
 
     @staticmethod
     def _make_session():
-        """Session with retry/backoff for 429 + 5xx (matches OpenAI SDK behavior)."""
+        """Session with retry/backoff for 429 + 5xx.
+
+        V9: ``respect_retry_after_header=False`` — a server-sent ``Retry-After``
+        of e.g. 3600s would otherwise block encode() for hours per attempt.
+        We rely on the exponential backoff only: 1s, 2s, 4s, 8s, 16s
+        (31s total worst case), bounded by total=5.
+        """
         import requests
         from requests.adapters import HTTPAdapter
         from urllib3.util.retry import Retry
 
         retry = Retry(
             total=5,
-            backoff_factor=1.0,  # 1s, 2s, 4s, 8s, 16s
+            backoff_factor=1.0,
             status_forcelist=(429, 500, 502, 503, 504),
             allowed_methods=frozenset(["POST"]),
-            respect_retry_after_header=True,
+            respect_retry_after_header=False,  # V9 fix
             raise_on_status=False,
         )
         session = requests.Session()
@@ -125,6 +167,13 @@ def encode(self, texts: list[str]) -> np.ndarray:
         norms = np.linalg.norm(arr, axis=1, keepdims=True)
         return arr / np.maximum(norms, 1e-8)
 
+    def close(self) -> None:
+        """Release the underlying requests Session (file descriptors)."""
+        try:
+            self._session.close()
+        except Exception:
+            pass
+
     @property
     def name(self) -> str:
         return f"mistral_{self._model}"
@@ -144,8 +193,18 @@ def __init__(self, cache_dir: Path):
         self.cache_dir.mkdir(parents=True, exist_ok=True)
 
     def _key(self, model_name: str, texts: list[str]) -> str:
-        h = hashlib.sha256(f"{model_name}:{'|'.join(texts)}".encode()).hexdigest()[:16]
-        return f"{model_name}_{h}"
+        # V13: JSON-encode the (model_name, texts) tuple before hashing so
+        # texts containing `|` (e.g. `s1 | s2`) cannot collide with split
+        # variants. JSON length-prefixes each string implicitly via quoting.
+        payload = _json.dumps(
+            {"model": model_name, "texts": texts},
+            ensure_ascii=False, separators=(",", ":"),
+        ).encode("utf-8")
+        h = hashlib.sha256(payload).hexdigest()[:16]
+        # Filesystem-safe: model_name now may contain '__' (from SentenceTransformerEmbedder.name)
+        # and '@<sha>' — those are safe on POSIX. Replace any remaining '/' just in case.
+        safe = model_name.replace("/", "__")
+        return f"{safe}_{h}"
 
     def get(self, model_name: str, texts: list[str]) -> np.ndarray | None:
         path = self.cache_dir / f"{self._key(model_name, texts)}.npz"
diff --git a/paper/main.tex b/paper/main.tex
index d91383d..e17b2fd 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -460,7 +460,7 @@ \subsection{Pilot Experiment and Results}\label{sec:pilot}
 P7 is \textbf{strongly supported}: Korean spacing variants cluster ${\sim}3\times$ closer than semantically different operations. This holds across both models with tight bootstrap confidence intervals. Byte-level models (ByT5, CANINE) should achieve higher ratios.
 
 \paragraph{NL-Code Cross-Modal Alignment.}
-To directly test PRH for code, we embed 50 computational NL descriptions alongside their Python code equivalents through four models: UniXcoder~\cite{unixcoder} (code-trained, 768d), MiniLM-L12 (NL-only, 384d), Nomic Embed Text v1.5 (768d), and E5-large (1024d). We define $R_{\text{code}} = d_{\text{mismatch}} / d_{\text{match}}$, where $d_{\text{match}}$ is the distance between an NL description and its corresponding code, and $d_{\text{mismatch}}$ is the distance to a different operation's code. We compute per-language $R_{\text{code}}$ with permutation tests ($n=10{,}000$; shuffled NL-code pairings serve as the random-matching baseline with null $R \approx 1$) and bootstrap confidence intervals ($n=10{,}000$), corrected via Holm-Bonferroni across 20 cells. To our knowledge, this is the first per-language $\times$ per-model NL-code alignment matrix reported in the cross-lingual representation literature; concurrent work on omnilingual sentence-code embeddings extends the modality set at the model level rather than measuring the cross-lingual gradient within a fixed code-stimulus set.
+To directly test PRH for code, we embed 50 computational NL descriptions alongside their Python code equivalents through four models: UniXcoder~\cite{unixcoder} (code-trained, 768d), MiniLM-L12 (NL-only, 384d), Nomic Embed Text v1.5 (768d), and E5-large (1024d). We define $R_{\text{code}} = d_{\text{mismatch}} / d_{\text{match}}$, where $d_{\text{match}}$ is the distance between an NL description and its corresponding code, and $d_{\text{mismatch}}$ is the distance to a different operation's code. We compute per-language $R_{\text{code}}$ with permutation tests ($n=10{,}000$; shuffled NL-code pairings serve as the random-matching baseline with null $R \approx 1$) and bootstrap confidence intervals ($n=10{,}000$), corrected via Holm-Bonferroni across 35 cells (7 models $\times$ 5 languages). To our knowledge, this is the first per-language $\times$ per-model NL-code alignment matrix reported in the cross-lingual representation literature; concurrent work on omnilingual sentence-code embeddings extends the modality set at the model level rather than measuring the cross-lingual gradient within a fixed code-stimulus set.
 
 \begin{table}[h]
 \centering
@@ -469,7 +469,7 @@ \subsection{Pilot Experiment and Results}\label{sec:pilot}
 \toprule
 \textbf{Model} & \textbf{en} & \textbf{ko} & \textbf{zh} & \textbf{ar} & \textbf{es} & \textbf{agg} \\
 \midrule
-UniXcoder (code)   & 1.22* & 1.01* & 1.08* & 1.01* & 1.05* & 1.07 \\
+UniXcoder (code)   & 1.22* & 1.01* & 1.08* & 1.01* & 1.05* & 1.06 \\
 MiniLM-L12 (NL)    & 1.23* & 1.12* & 1.18* & 1.10* & 1.19* & 1.16 \\
 Nomic v1.5         & 1.24* & 1.02* & 1.03* & 1.01* & 1.07* & 1.07 \\
 E5-small (NL)      & 1.22* & 1.09* & 1.13* & 1.09* & 1.14* & 1.13 \\
@@ -483,7 +483,7 @@ \subsection{Pilot Experiment and Results}\label{sec:pilot}
 
 All 35 language-model cells show $R_{\text{code}} > 1$ ($p < 0.05$ after Holm-Bonferroni correction across 35 cells): NL descriptions are closer to their corresponding code than to mismatched code in every language and every model. The permutation null mean falls in $R \in [1.000, 1.005]$ across all cells, confirming the effect is not a metric artifact. The result is robust across code-trained (UniXcoder, Nomic), hybrid (BGE-M3), and NL-only (MiniLM, E5 family) architectures.
 
-Three patterns emerge. First, \textbf{$\Dtrain$ modulates NL-code alignment}: English consistently shows the highest $R_{\text{code}}$ (1.21--1.28), while Korean and Arabic show the lowest (1.01--1.16), tracking language representation in code training corpora. Second, \textbf{NL-only models achieve higher $R_{\text{code}}$ than code-trained models}: E5-large (1.20 aggregate) and BGE-M3 (1.16) surpass UniXcoder (1.07) and Nomic (1.07). Third, \textbf{the E5 family (same architecture, identical training recipe, varying dimension) shows partial scale-convergence}: aggregate $R_{\text{code}}$ rises $1.13$ (small, 384d) $\to 1.14$ (base, 768d) $\to 1.20$ (large, 1024d); the small-to-base jump is flat while base-to-large is steep, so P1's monotonic-with-scale prediction holds qualitatively but is non-linear in this regime.
+Three patterns emerge. First, \textbf{$\Dtrain$ modulates NL-code alignment}: English consistently shows the highest $R_{\text{code}}$ (1.21--1.28), while Korean and Arabic show the lowest (1.01--1.16), tracking language representation in code training corpora. Second, \textbf{NL-only models achieve higher $R_{\text{code}}$ than code-trained models}: E5-large (1.20 aggregate) and BGE-M3 (1.16) surpass UniXcoder (1.06) and Nomic (1.07). Third, \textbf{the E5 family (same architecture, identical training recipe, varying dimension) shows partial scale-convergence}: aggregate $R_{\text{code}}$ rises $1.13$ (small, 384d) $\to 1.14$ (base, 768d) $\to 1.20$ (large, 1024d); the small-to-base jump is flat while base-to-large is steep, so P1's monotonic-with-scale prediction holds qualitatively but is non-linear in this regime.
 
 \paragraph{Lexical overlap control.} A potential confound: NL descriptions share tokens with their code equivalents (``sort'' appears in both ``Sort the list'' and \texttt{sorted(lst)}). Token overlap correlates with $d_{\text{match}}$ (Spearman $\rho = -0.51$, $p < 0.001$ for MiniLM), confirming a lexical component. However, $R_{\text{code}} > 1$ survives two controls. First, for the 32/50 operations with \emph{zero} token overlap (after stemming), $R_{\text{code}}$ remains above 1 in all three models (1.06--1.18). Second, obfuscating variable names in code (\texttt{lst}$\to$\texttt{v0}, \texttt{s}$\to$\texttt{v0}) reduces $R_{\text{code}}$ by only 1.6--5.4\%, and all models retain $R_{\text{code}} > 1$. Lexical overlap inflates the effect but does not create it: the alignment is primarily semantic.
 
@@ -499,7 +499,7 @@ \subsection{Pilot Experiment and Results}\label{sec:pilot}
 \toprule
 \textbf{Model} & \textbf{en} & \textbf{ko} & \textbf{zh} & \textbf{ar} & \textbf{es} & \textbf{OOD agg} & \textbf{tier1 agg} \\
 \midrule
-UniXcoder (code)   & 1.54* & 1.02* & 1.21* & 1.02* & 1.13* & 1.15 & 1.07 \\
+UniXcoder (code)   & 1.54* & 1.02* & 1.21* & 1.02* & 1.13* & 1.15 & 1.06 \\
 MiniLM-L12 (NL)    & 1.44* & 1.26* & 1.29* & 1.24* & 1.32* & 1.31 & 1.16 \\
 Nomic v1.5         & 1.76* & 1.04* & 1.07* & 1.04* & 1.21* & 1.16 & 1.07 \\
 E5-small (NL)      & 1.68* & 1.20* & 1.22* & 1.14* & 1.34* & 1.28 & 1.13 \\
@@ -508,12 +508,12 @@ \subsection{Pilot Experiment and Results}\label{sec:pilot}
 BGE-M3 (NL+code)   & 1.59* & 1.31* & 1.30* & 1.26* & 1.38* & 1.36 & 1.16 \\
 \bottomrule
 \end{tabular}
-\caption*{\small * $p < 0.05$ after Holm-Bonferroni correction across 35 OOD cells. All 35/35 OOD cells significant. Permutation-null mean $R \in [1.004, 1.008]$. Cohen's $d$ up to $4.12$ (en, E5-large). Strategy F (\texttt{run\_strategy\_f\_ood\_alignment.py}).}
+\caption*{\small * $p < 0.05$ after Holm-Bonferroni correction across 35 OOD cells. All 35/35 OOD cells significant. Permutation-null mean $R \in [1.004, 1.008]$. Cohen's $d$ up to $4.42$ (en, E5-base). Strategy F (\texttt{run\_strategy\_f\_ood\_alignment.py}).}
 \end{table}
 
-OOD aggregate $R_{\text{code}}$ exceeds tier-1 for every model (UniXcoder $1.07 \to 1.15$; MiniLM $1.16 \to 1.31$; Nomic $1.07 \to 1.16$; E5-small $1.13 \to 1.28$; E5-base $1.14 \to 1.31$; E5-large $1.20 \to 1.33$; BGE-M3 $1.16 \to 1.36$). Multi-step algorithm descriptions are longer and more distinctive than stdlib 1-liners, and multi-line function bodies provide more discriminating signal; the embedding alignment exploits this richer surface form rather than being damaged by reduced co-occurrence frequency. The contamination caveat for tier 1 stands as a methodological honesty point, but the OOD result demonstrates that NL-code alignment survives the most direct contamination control available within the embedding-only paradigm---the effect is not primarily memorization-driven.
+OOD aggregate $R_{\text{code}}$ exceeds tier-1 for every model (UniXcoder $1.06 \to 1.15$; MiniLM $1.16 \to 1.31$; Nomic $1.07 \to 1.16$; E5-small $1.13 \to 1.28$; E5-base $1.14 \to 1.31$; E5-large $1.20 \to 1.33$; BGE-M3 $1.16 \to 1.36$). Multi-step algorithm descriptions are longer and more distinctive than stdlib 1-liners, and multi-line function bodies provide more discriminating signal; the embedding alignment exploits this richer surface form rather than being damaged by reduced co-occurrence frequency. The contamination caveat for tier 1 stands as a methodological honesty point, but the OOD result demonstrates that NL-code alignment survives the most direct contamination control available within the embedding-only paradigm---the effect is not primarily memorization-driven.
 
-Critically, this resolves the P2 result. P2 measured NL-NL cross-lingual invariance and found computational operations \emph{less} invariant---a finding the vocabulary mediation and language-pair analyses explain as a property of domain-specific terminology. The NL-code experiment shows that despite this description-level divergence, NL-code alignment is positive across all 20 cells. The four results form a coherent picture: (i)~computational vocabulary drives cross-lingual description divergence (vocabulary mediation); (ii)~this divergence is uniform across language pairs (language-pair decomposition); (iii)~yet NL-code alignment holds in every language and model (20/20 significant); (iv)~the alignment is modulated by $\Dtrain$, not eliminated. Description-level invariance and execution-level convergence are distinct phenomena---\textbf{convergence $\neq$ communicability}.
+Critically, this resolves the P2 result. P2 measured NL-NL cross-lingual invariance and found computational operations \emph{less} invariant---a finding the vocabulary mediation and language-pair analyses explain as a property of domain-specific terminology. The NL-code experiment shows that despite this description-level divergence, NL-code alignment is positive across all 35 cells (7 models $\times$ 5 languages). The four findings form a coherent picture: (i)~computational vocabulary drives cross-lingual description divergence (vocabulary mediation); (ii)~this divergence is uniform across language pairs (language-pair decomposition); (iii)~yet NL-code alignment holds in every language and model (35/35 significant, plus the OOD extension below); (iv)~the alignment is modulated by $\Dtrain$, not eliminated. Description-level invariance and execution-level convergence are distinct phenomena---\textbf{convergence $\neq$ communicability}.
 
 \paragraph{P7 Extension: Punctuation Robustness.}
 We extend P7 to punctuation and formatting variants. For each of 100 English operations, we generate 10 variants: bare, period, question mark, exclamation, ellipsis, colon, lowercase, UPPERCASE, extra spaces, and article removal. $R_{\text{punct}} = d_{\text{semantic}} / d_{\text{punct}} = 13.6$---punctuation variants are ${\sim}14\times$ closer than semantically different operations, far exceeding spacing robustness ($R_{\text{spacing}} \approx 2.9$). Most variants drift minimally (period: 0.014, question mark: 0.013). The outlier is UPPERCASE (drift = 0.192), which acts as a pragmatic signal (emphasis, shouting)---evidence that $\Zprag$ is encoded in surface-form cues even when $\Zsem$ is unchanged.
@@ -660,7 +660,7 @@ \section*{Limitations}
 
 \textbf{PRH is a hypothesis, not a theorem.} PRH for code as a modality remains a conjecture. Recent evidence (language-independent code semantics~\cite{beyondsyntax2025}, cross-model transferability~\cite{chen2025crossmodel}) is consistent with convergence but not proof.
 
-\textbf{Pilot measures description-level, not execution-level, convergence.} The P2 failure highlights this gap: NL embedding similarity is a proxy, not a direct test, of $\Zsem$ convergence. Our follow-up analyses (vocabulary mediation and language-pair decomposition) explain the P2 failure as a description-level vocabulary phenomenon, and the NL-code alignment experiment confirms execution-level convergence across four models and five languages (20/20 cells significant). However, all models are sentence-level embedders; decoder-only LLM representations may behave differently.
+\textbf{Pilot measures description-level, not execution-level, convergence.} The P2 failure highlights this gap: NL embedding similarity is a proxy, not a direct test, of $\Zsem$ convergence. Our follow-up analyses (vocabulary mediation and language-pair decomposition) explain the P2 failure as a description-level vocabulary phenomenon, and the NL-code alignment experiment confirms execution-level convergence across seven models and five languages (35/35 cells significant, plus a parallel 35/35 result on OOD multi-step / compositional stimuli). However, all models are sentence-level embedders; decoder-only LLM representations may behave differently.
 
 \textbf{The $Z$ stratification is a conceptual framework.} While the pilot provides supporting evidence (P7 supported, P2 failure explained by vocabulary mediation, P3 supported on 7 models with model-class dependence---multilingual NL strong, code-trained / NL+code mixed weak), large-scale probing across decoder-only LLM families (e.g., Llama 3.1 hidden states) and operation-level OOD stimuli (\texttt{tier2\_multistep.json}, \texttt{tier3\_compositional.json} in the experiment repository) remains future work.
 
diff --git a/planning/decisions.md b/planning/decisions.md
index 31920c0..150e583 100644
--- a/planning/decisions.md
+++ b/planning/decisions.md
@@ -137,3 +137,45 @@ Format: `## YYYY-MM-DD -- <short title>` with **Context**, **Decision**, **Why**
   - `experiments/README.md` Reproducibility envelope bullet added: model-weight pinning policy + pointer to the registry's refresh snippet.
 
 **Why**: C3 was originally classified as a Minor TODO because the embedding cache covered the practical reproducibility need. Centralizing the registry now (rather than after another experiment lands) prevents future SHA drift between runners and gives reviewers a single auditable location for "which exact weights did this paper use?"
+
+---
+
+## 2026-05-21 -- Extra-high recall code review (15 findings, all fixed)
+
+**Context**: After PRs #1-#7 landed, `/code-review` ran at xhigh-effort recall mode (5 angles × ≤8 candidates → 1-vote verify → sweep, capped at 15 findings). Output was 15 confirmed/plausible defects spanning paper text drift, statistical method gaps, cache-poisoning vectors, and a partial-success silent-corruption hole in the FWE pipeline.
+
+**Decisions (all 15 fixed in a single review-closure PR)**:
+
+  - **V1 / V14 cache key drops revision + basename collision across orgs**: `SentenceTransformerEmbedder.name` now uses `f"st_{repo.replace('/', '__')}@{rev[:8]}"`. EmbeddingCache._key collisions across `intfloat/x` vs `sentence-transformers/x` and across SHA bumps are now distinct. C3 closure actually holds end-to-end.
+
+  - **V13 cache key delimiter collision**: `EmbeddingCache._key` switched from `f"{m}:{'|'.join(texts)}"` to a JSON-encoded payload hash so texts containing `|` (e.g. `s1 | s2` in the union stimulus) cannot collide with split variants. Verified inline: `['a|b','c']` and `['a','b|c']` now hash to distinct keys.
+
+  - **V18 dimension None for trust_remote_code modules**: `SentenceTransformerEmbedder.dimension` falls back to a one-text encode probe when the deprecated `get_sentence_embedding_dimension()` returns None. Nomic v1.5 no longer risks a silent skip from `int(None)`.
+
+  - **V9 Mistral Retry-After hang**: `respect_retry_after_header=False` on the urllib3 Retry; backoff_factor=1 bounds total wait to ~31s instead of up to 5× server-sent `Retry-After`. Eliminates the multi-hour silent stall mode.
+
+  - **V10 OpenAI timeout regression**: 60s → 300s. Legacy batch callers that need >60s server-side processing no longer hit a spurious timeout under the new client.
+
+  - **V5 perm/bootstrap fallback to 1.0**: substituted NaN instead. `random_baseline_R_mean` now uses `np.nanmean`. New result range [1.0001, 1.0046] (tier1) / [1.0005, 1.0086] (OOD), still ≈1 as the paper claims, but no longer biased by silent 1.0 imputations.
+
+  - **V6 p_value floor**: `(n_extreme + 1) / (n_valid + 1)` convention adopted. Reported p-values are now bounded below by `1/(n_perm+1) ≈ 1e-4`; no cell reports literal `0.0` (verified post-rerun: min nonzero p = 0.0001 across all 70 D+F cells). Reviewer push-back surface closed.
+
+  - **V8 partial-success FWE silent invalidation**: Strategy D/E/F main() now `sys.exit(2)` on any failed model unless `Z_GAP_ALLOW_PARTIAL_RESULTS=1` is set. The "across 35 cells" claim in the paper can no longer be silently invalidated by a single OOM / trust_remote_code drift. The Nomic einops episode from PR #4 is the exact failure mode this guards against; the previous lenient behavior would have let it slip if `failed_models` had been ignored.
+
+  - **V7 figures-before-save**: Strategy D/E/F save JSON BEFORE generating figures, with figures in a try/except. Multi-hour compute is no longer lost to a matplotlib font-cache failure.
+
+  - **V11 Strategy E `categories[op_id]` KeyError**: replaced with `categories.get()` + explicit `_label` helper that returns None for unknown categories. Empty per-language test sets also produce `{skip: true}` cells with NaN accuracy instead of crashing on `clf.predict(np.array([]))`.
+
+  - **V12 tier2/tier3 op_id uniqueness**: `load_ood_stimuli()` now asserts uniqueness with the duplicate list surfaced in the error message. Today's stimuli pass (verified inline: 50/50 unique), but a future id collision will fail loudly.
+
+  - **V2 synthesis JSON envelope shim**: `_normalize_results_envelope()` unwraps `{_meta, results}` to a plain list so `run_cross_experiment_synthesis.py` keeps working with the new D/E/F JSON shape. Also added strategy_e and strategy_f to its known-files list.
+
+  - **V20 synthesis treats `aggregate` as a 6th language**: explicit `if lang == "aggregate": continue` in the per-language counter loop. The "n_significant / total_cells" rate is now denominated against the real 5-language × 7-model = 35 grid, not 42.
+
+  - **V4 datetime.utcnow deprecation in Strategy D**: replaced with `datetime.now(datetime.UTC)` to match Strategy E/F and survive future Python ≥3.13 removal.
+
+  - **V3 paper §5.5 / Limitations "20 cells / four models / 20/20" drift (3 locations)**: updated to "35 cells / seven models / 35/35 + OOD 35/35", matching the Strategy D/E/F tables already inserted in PR #4/#5/#6.
+
+**Re-execution**: Strategies D/E/F rerun after all fixes (~5 min, 7/7 models succeeded each). Cell-level R_code values unchanged at 2-decimal precision except UniXcoder tier1 aggregate (1.0649 ≈ 1.06 vs. previously printed 1.07 — rounding). Cohen's d_max for OOD shifted from E5-large (4.12) to E5-base (4.42); paper updated. All 35/35 + 35/35 + multi-model P3 conclusions hold.
+
+**Why**: Recall-mode review surfaces real bugs at the cost of some false positives. Of the 15 confirmed findings, V8 (silent FWE invalidation) and V1 (cache key drops revision) would have been the most damaging if discovered after EMNLP submission. Closing them all in a single review-closure PR keeps the paper-evidence chain (Strategy D 35/35 tier1, P3 7-model, Strategy F 35/35 OOD) sound under reviewer push-back.