fix(z-gap): close all 15 findings from xhigh-recall code review by heznpc · Pull Request #8 · heznpc/z-gap

heznpc · 2026-05-28T22:31:42Z

Critical fixes (paper claim integrity):

V8: Strategy D/E/F sys.exit(2) on any failed model unless
Z_GAP_ALLOW_PARTIAL_RESULTS=1. Holm-Bonferroni's "across 35 cells"
claim can no longer be silently invalidated by a model dropout.
V1/V14: SentenceTransformerEmbedder.name now includes full repo path
(org/name → org__name) and @<sha8> revision suffix. C3 closure
(PR experiments(z-gap): frozen model registry with HuggingFace SHAs (closes C3) #7) now actually holds end-to-end against SHA bumps and across
org/name basename collisions.
V2: run_cross_experiment_synthesis._normalize_results_envelope()
unwraps {_meta, results} so legacy consumers keep working with the
new D/E/F JSON shape; strategy_e and strategy_f added to known files.
V3: paper §5.5 / Limitations "20 cells / four models / 20/20" drift
updated to "35 cells / seven models / 35/35 + OOD 35/35" in 3 places
(L463 method, L516 P2-resolution, L663 Limitations).

Statistical fixes (correctness):

V5: compute_per_language_R_code substitutes NaN (not 1.0) when
d_match_perm is empty or bootstrap mean_m ≤ 1e-10; np.nanmean for
random_baseline_R aggregation. New null R range [1.0001, 1.0046]
(tier1) / [1.0005, 1.0086] (OOD), unbiased by silent 1.0 imputations.
V6: permutation p-value uses (k+1)/(n+1) convention. No cell reports
literal 0.0 anymore (verified post-rerun: min nonzero p = 0.0001
across all 70 D+F cells). Reviewer push-back surface closed.

Robustness / pitfall fixes:

V7: Strategy D/E/F save JSON BEFORE generating figures; figures in
try/except. Multi-hour compute no longer lost to matplotlib fail.
V9: MistralEmbedder Retry sets respect_retry_after_header=False;
bounded by backoff_factor=1 (~31s worst case), eliminating the
multi-hour silent stall mode on server-sent Retry-After.
V10: OpenAI client timeout 60s → 300s; legacy batch callers no longer
regress on slow server-side processing.
V11: Strategy E replaces categories[op_id] with categories.get() +
_label() helper. Empty test sets produce {skip:true} cells with
NaN accuracy instead of crashing on clf.predict(np.array([])).
V12: load_ood_stimuli() asserts tier2/tier3 op_id uniqueness with the
duplicate list in the error message. 50/50 unique today; future
collision will fail loudly.
V13: EmbeddingCache._key() switched from |-joined string to a
JSON-encoded payload hash. ['a|b','c'] and ['a','b|c'] now hash
to distinct keys.
V18: SentenceTransformerEmbedder.dimension falls back to a single-text
encode probe when the deprecated get_sentence_embedding_dimension()
returns None. Nomic v1.5 no longer risks int(None) silent skip.
V20: synthesis script counts aggregate as a 6th language → explicit
if lang == "aggregate": continue in the per-language counter.

Hygiene:

V4: Strategy D datetime.datetime.utcnow() → datetime.now(datetime.UTC),
matching E/F and surviving future Python ≥3.13 removal.

Re-execution:

D/E/F rerun successfully (7/7 models each, ~5 min wall time).
35/35 + multi-model P3 + 35/35 OOD all preserved.
2-decimal R_code values unchanged except UniXcoder tier1 (1.0649 ≈ 1.06,
was printed 1.07).
OOD Cohen's d_max E5-large 4.12 → E5-base 4.42; paper updated.

Decisions log:

planning/decisions.md: 2026-05-21 entry covering all 15 fixes with
per-finding rationale and the re-execution outcome.

Critical fixes (paper claim integrity): - V8: Strategy D/E/F sys.exit(2) on any failed model unless Z_GAP_ALLOW_PARTIAL_RESULTS=1. Holm-Bonferroni's "across 35 cells" claim can no longer be silently invalidated by a model dropout. - V1/V14: SentenceTransformerEmbedder.name now includes full repo path (org/name → org__name) and `@<sha8>` revision suffix. C3 closure (PR #7) now actually holds end-to-end against SHA bumps and across org/name basename collisions. - V2: run_cross_experiment_synthesis._normalize_results_envelope() unwraps {_meta, results} so legacy consumers keep working with the new D/E/F JSON shape; strategy_e and strategy_f added to known files. - V3: paper §5.5 / Limitations "20 cells / four models / 20/20" drift updated to "35 cells / seven models / 35/35 + OOD 35/35" in 3 places (L463 method, L516 P2-resolution, L663 Limitations). Statistical fixes (correctness): - V5: compute_per_language_R_code substitutes NaN (not 1.0) when d_match_perm is empty or bootstrap mean_m ≤ 1e-10; np.nanmean for random_baseline_R aggregation. New null R range [1.0001, 1.0046] (tier1) / [1.0005, 1.0086] (OOD), unbiased by silent 1.0 imputations. - V6: permutation p-value uses (k+1)/(n+1) convention. No cell reports literal 0.0 anymore (verified post-rerun: min nonzero p = 0.0001 across all 70 D+F cells). Reviewer push-back surface closed. Robustness / pitfall fixes: - V7: Strategy D/E/F save JSON BEFORE generating figures; figures in try/except. Multi-hour compute no longer lost to matplotlib fail. - V9: MistralEmbedder Retry sets respect_retry_after_header=False; bounded by backoff_factor=1 (~31s worst case), eliminating the multi-hour silent stall mode on server-sent Retry-After. - V10: OpenAI client timeout 60s → 300s; legacy batch callers no longer regress on slow server-side processing. - V11: Strategy E replaces categories[op_id] with categories.get() + _label() helper. Empty test sets produce {skip:true} cells with NaN accuracy instead of crashing on clf.predict(np.array([])). - V12: load_ood_stimuli() asserts tier2/tier3 op_id uniqueness with the duplicate list in the error message. 50/50 unique today; future collision will fail loudly. - V13: EmbeddingCache._key() switched from `|`-joined string to a JSON-encoded payload hash. ['a|b','c'] and ['a','b|c'] now hash to distinct keys. - V18: SentenceTransformerEmbedder.dimension falls back to a single-text encode probe when the deprecated get_sentence_embedding_dimension() returns None. Nomic v1.5 no longer risks int(None) silent skip. - V20: synthesis script counts aggregate as a 6th language → explicit `if lang == "aggregate": continue` in the per-language counter. Hygiene: - V4: Strategy D datetime.datetime.utcnow() → datetime.now(datetime.UTC), matching E/F and surviving future Python ≥3.13 removal. Re-execution: - D/E/F rerun successfully (7/7 models each, ~5 min wall time). - 35/35 + multi-model P3 + 35/35 OOD all preserved. - 2-decimal R_code values unchanged except UniXcoder tier1 (1.0649 ≈ 1.06, was printed 1.07). - OOD Cohen's d_max E5-large 4.12 → E5-base 4.42; paper updated. Decisions log: - planning/decisions.md: 2026-05-21 entry covering all 15 fixes with per-finding rationale and the re-execution outcome.

heznpc merged commit 3dcd986 into main May 28, 2026

heznpc deleted the chore/code-review-fixes-2026-05-21 branch May 28, 2026 22:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(z-gap): close all 15 findings from xhigh-recall code review#8

fix(z-gap): close all 15 findings from xhigh-recall code review#8
heznpc merged 1 commit into
mainfrom
chore/code-review-fixes-2026-05-21

heznpc commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heznpc commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant