fix(z-gap): close all 15 findings from xhigh-recall code review#8
Merged
Conversation
Critical fixes (paper claim integrity):
- V8: Strategy D/E/F sys.exit(2) on any failed model unless
Z_GAP_ALLOW_PARTIAL_RESULTS=1. Holm-Bonferroni's "across 35 cells"
claim can no longer be silently invalidated by a model dropout.
- V1/V14: SentenceTransformerEmbedder.name now includes full repo path
(org/name → org__name) and `@<sha8>` revision suffix. C3 closure
(PR #7) now actually holds end-to-end against SHA bumps and across
org/name basename collisions.
- V2: run_cross_experiment_synthesis._normalize_results_envelope()
unwraps {_meta, results} so legacy consumers keep working with the
new D/E/F JSON shape; strategy_e and strategy_f added to known files.
- V3: paper §5.5 / Limitations "20 cells / four models / 20/20" drift
updated to "35 cells / seven models / 35/35 + OOD 35/35" in 3 places
(L463 method, L516 P2-resolution, L663 Limitations).
Statistical fixes (correctness):
- V5: compute_per_language_R_code substitutes NaN (not 1.0) when
d_match_perm is empty or bootstrap mean_m ≤ 1e-10; np.nanmean for
random_baseline_R aggregation. New null R range [1.0001, 1.0046]
(tier1) / [1.0005, 1.0086] (OOD), unbiased by silent 1.0 imputations.
- V6: permutation p-value uses (k+1)/(n+1) convention. No cell reports
literal 0.0 anymore (verified post-rerun: min nonzero p = 0.0001
across all 70 D+F cells). Reviewer push-back surface closed.
Robustness / pitfall fixes:
- V7: Strategy D/E/F save JSON BEFORE generating figures; figures in
try/except. Multi-hour compute no longer lost to matplotlib fail.
- V9: MistralEmbedder Retry sets respect_retry_after_header=False;
bounded by backoff_factor=1 (~31s worst case), eliminating the
multi-hour silent stall mode on server-sent Retry-After.
- V10: OpenAI client timeout 60s → 300s; legacy batch callers no longer
regress on slow server-side processing.
- V11: Strategy E replaces categories[op_id] with categories.get() +
_label() helper. Empty test sets produce {skip:true} cells with
NaN accuracy instead of crashing on clf.predict(np.array([])).
- V12: load_ood_stimuli() asserts tier2/tier3 op_id uniqueness with the
duplicate list in the error message. 50/50 unique today; future
collision will fail loudly.
- V13: EmbeddingCache._key() switched from `|`-joined string to a
JSON-encoded payload hash. ['a|b','c'] and ['a','b|c'] now hash
to distinct keys.
- V18: SentenceTransformerEmbedder.dimension falls back to a single-text
encode probe when the deprecated get_sentence_embedding_dimension()
returns None. Nomic v1.5 no longer risks int(None) silent skip.
- V20: synthesis script counts aggregate as a 6th language → explicit
`if lang == "aggregate": continue` in the per-language counter.
Hygiene:
- V4: Strategy D datetime.datetime.utcnow() → datetime.now(datetime.UTC),
matching E/F and surviving future Python ≥3.13 removal.
Re-execution:
- D/E/F rerun successfully (7/7 models each, ~5 min wall time).
- 35/35 + multi-model P3 + 35/35 OOD all preserved.
- 2-decimal R_code values unchanged except UniXcoder tier1 (1.0649 ≈ 1.06,
was printed 1.07).
- OOD Cohen's d_max E5-large 4.12 → E5-base 4.42; paper updated.
Decisions log:
- planning/decisions.md: 2026-05-21 entry covering all 15 fixes with
per-finding rationale and the re-execution outcome.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Critical fixes (paper claim integrity):
Z_GAP_ALLOW_PARTIAL_RESULTS=1. Holm-Bonferroni's "across 35 cells"
claim can no longer be silently invalidated by a model dropout.
(org/name → org__name) and
@<sha8>revision suffix. C3 closure(PR experiments(z-gap): frozen model registry with HuggingFace SHAs (closes C3) #7) now actually holds end-to-end against SHA bumps and across
org/name basename collisions.
unwraps {_meta, results} so legacy consumers keep working with the
new D/E/F JSON shape; strategy_e and strategy_f added to known files.
updated to "35 cells / seven models / 35/35 + OOD 35/35" in 3 places
(L463 method, L516 P2-resolution, L663 Limitations).
Statistical fixes (correctness):
d_match_perm is empty or bootstrap mean_m ≤ 1e-10; np.nanmean for
random_baseline_R aggregation. New null R range [1.0001, 1.0046]
(tier1) / [1.0005, 1.0086] (OOD), unbiased by silent 1.0 imputations.
literal 0.0 anymore (verified post-rerun: min nonzero p = 0.0001
across all 70 D+F cells). Reviewer push-back surface closed.
Robustness / pitfall fixes:
try/except. Multi-hour compute no longer lost to matplotlib fail.
bounded by backoff_factor=1 (~31s worst case), eliminating the
multi-hour silent stall mode on server-sent Retry-After.
regress on slow server-side processing.
_label() helper. Empty test sets produce {skip:true} cells with
NaN accuracy instead of crashing on clf.predict(np.array([])).
duplicate list in the error message. 50/50 unique today; future
collision will fail loudly.
|-joined string to aJSON-encoded payload hash. ['a|b','c'] and ['a','b|c'] now hash
to distinct keys.
encode probe when the deprecated get_sentence_embedding_dimension()
returns None. Nomic v1.5 no longer risks int(None) silent skip.
if lang == "aggregate": continuein the per-language counter.Hygiene:
matching E/F and surviving future Python ≥3.13 removal.
Re-execution:
was printed 1.07).
Decisions log:
per-finding rationale and the re-execution outcome.