experiments(z-gap): frozen model registry with HuggingFace SHAs (closes C3) by heznpc · Pull Request #7 · heznpc/z-gap

heznpc · 2026-05-20T18:08:16Z

Added experiments/src/model_registry.py:

MODELS_7_FROZEN: 7 (model_name, label, kwargs) tuples used by
Strategy D / E / F, each pinned to the main commit SHA observed via
huggingface_hub.HfApi on 2026-05-21.
registry_sha_summary(): JSON-serializable mapping for run_meta blocks.

UniXcoder 5604afdc964f6c53782a6813140ade5216b99006
MiniLM-L12 e8f8c211226b894fcb81acc59f3b34ba3efd5f42
Nomic v1.5 e9b6763023c676ca8431644204f50c2b100d9aab
E5-small 614241f622f53c4eeff9890bdc4f31cfecc418b3
E5-base d128750597153bb5987e10b1c3493a34e5a4502a
E5-large 3d7cfbdacd47fdda877c5cd8a79fbcc4f2a574f3
BGE-M3 5617a9f61b028005a4858fdac845db406aefb181

Refactored 3 runners to:
from src.model_registry import MODELS_7_FROZEN, registry_sha_summary
MODELS = MODELS_7_FROZEN
Inline MODELS lists removed from Strategy D / E / F. Each runner's
run_meta now embeds model_revisions: {model: sha} so SHAs are recorded
in every results JSON for forensic reproducibility.

sentence-transformers >=5.5 accepts revision= in SentenceTransformer.init
(confirmed via inspect.signature on the installed 5.5.1). Embedding-level
reproducibility is additionally guaranteed by EmbeddingCache (.npz keyed
by (model_name, text_hash)).

Docs:

experiments/README.md Reproducibility envelope bullet added with
the model-weight pinning policy + pointer to the refresh snippet.

Decisions log:

planning/decisions.md: 2026-05-21 model registry entry.

Closes C3 from the 2026-05-21 pre-experiment review.

…es C3) Added experiments/src/model_registry.py: - MODELS_7_FROZEN: 7 (model_name, label, kwargs) tuples used by Strategy D / E / F, each pinned to the main commit SHA observed via huggingface_hub.HfApi on 2026-05-21. - registry_sha_summary(): JSON-serializable mapping for run_meta blocks. UniXcoder 5604afdc964f6c53782a6813140ade5216b99006 MiniLM-L12 e8f8c211226b894fcb81acc59f3b34ba3efd5f42 Nomic v1.5 e9b6763023c676ca8431644204f50c2b100d9aab E5-small 614241f622f53c4eeff9890bdc4f31cfecc418b3 E5-base d128750597153bb5987e10b1c3493a34e5a4502a E5-large 3d7cfbdacd47fdda877c5cd8a79fbcc4f2a574f3 BGE-M3 5617a9f61b028005a4858fdac845db406aefb181 Refactored 3 runners to: from src.model_registry import MODELS_7_FROZEN, registry_sha_summary MODELS = MODELS_7_FROZEN Inline MODELS lists removed from Strategy D / E / F. Each runner's run_meta now embeds `model_revisions: {model: sha}` so SHAs are recorded in every results JSON for forensic reproducibility. sentence-transformers >=5.5 accepts revision= in SentenceTransformer.__init__ (confirmed via inspect.signature on the installed 5.5.1). Embedding-level reproducibility is additionally guaranteed by EmbeddingCache (.npz keyed by (model_name, text_hash)). Docs: - experiments/README.md Reproducibility envelope bullet added with the model-weight pinning policy + pointer to the refresh snippet. Decisions log: - planning/decisions.md: 2026-05-21 model registry entry. Closes C3 from the 2026-05-21 pre-experiment review.

Critical fixes (paper claim integrity): - V8: Strategy D/E/F sys.exit(2) on any failed model unless Z_GAP_ALLOW_PARTIAL_RESULTS=1. Holm-Bonferroni's "across 35 cells" claim can no longer be silently invalidated by a model dropout. - V1/V14: SentenceTransformerEmbedder.name now includes full repo path (org/name → org__name) and `@<sha8>` revision suffix. C3 closure (PR #7) now actually holds end-to-end against SHA bumps and across org/name basename collisions. - V2: run_cross_experiment_synthesis._normalize_results_envelope() unwraps {_meta, results} so legacy consumers keep working with the new D/E/F JSON shape; strategy_e and strategy_f added to known files. - V3: paper §5.5 / Limitations "20 cells / four models / 20/20" drift updated to "35 cells / seven models / 35/35 + OOD 35/35" in 3 places (L463 method, L516 P2-resolution, L663 Limitations). Statistical fixes (correctness): - V5: compute_per_language_R_code substitutes NaN (not 1.0) when d_match_perm is empty or bootstrap mean_m ≤ 1e-10; np.nanmean for random_baseline_R aggregation. New null R range [1.0001, 1.0046] (tier1) / [1.0005, 1.0086] (OOD), unbiased by silent 1.0 imputations. - V6: permutation p-value uses (k+1)/(n+1) convention. No cell reports literal 0.0 anymore (verified post-rerun: min nonzero p = 0.0001 across all 70 D+F cells). Reviewer push-back surface closed. Robustness / pitfall fixes: - V7: Strategy D/E/F save JSON BEFORE generating figures; figures in try/except. Multi-hour compute no longer lost to matplotlib fail. - V9: MistralEmbedder Retry sets respect_retry_after_header=False; bounded by backoff_factor=1 (~31s worst case), eliminating the multi-hour silent stall mode on server-sent Retry-After. - V10: OpenAI client timeout 60s → 300s; legacy batch callers no longer regress on slow server-side processing. - V11: Strategy E replaces categories[op_id] with categories.get() + _label() helper. Empty test sets produce {skip:true} cells with NaN accuracy instead of crashing on clf.predict(np.array([])). - V12: load_ood_stimuli() asserts tier2/tier3 op_id uniqueness with the duplicate list in the error message. 50/50 unique today; future collision will fail loudly. - V13: EmbeddingCache._key() switched from `|`-joined string to a JSON-encoded payload hash. ['a|b','c'] and ['a','b|c'] now hash to distinct keys. - V18: SentenceTransformerEmbedder.dimension falls back to a single-text encode probe when the deprecated get_sentence_embedding_dimension() returns None. Nomic v1.5 no longer risks int(None) silent skip. - V20: synthesis script counts aggregate as a 6th language → explicit `if lang == "aggregate": continue` in the per-language counter. Hygiene: - V4: Strategy D datetime.datetime.utcnow() → datetime.now(datetime.UTC), matching E/F and surviving future Python ≥3.13 removal. Re-execution: - D/E/F rerun successfully (7/7 models each, ~5 min wall time). - 35/35 + multi-model P3 + 35/35 OOD all preserved. - 2-decimal R_code values unchanged except UniXcoder tier1 (1.0649 ≈ 1.06, was printed 1.07). - OOD Cohen's d_max E5-large 4.12 → E5-base 4.42; paper updated. Decisions log: - planning/decisions.md: 2026-05-21 entry covering all 15 fixes with per-finding rationale and the re-execution outcome.

heznpc merged commit 638a9a0 into main May 20, 2026

heznpc deleted the chore/model-registry-sha-pin-2026-05-21 branch May 20, 2026 18:08

heznpc mentioned this pull request May 28, 2026

fix(z-gap): close all 15 findings from xhigh-recall code review #8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiments(z-gap): frozen model registry with HuggingFace SHAs (closes C3)#7

experiments(z-gap): frozen model registry with HuggingFace SHAs (closes C3)#7
heznpc merged 1 commit into
mainfrom
chore/model-registry-sha-pin-2026-05-21

heznpc commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heznpc commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant