experiments(z-gap): frozen model registry with HuggingFace SHAs (closes C3)#7
Merged
Merged
Conversation
…es C3)
Added experiments/src/model_registry.py:
- MODELS_7_FROZEN: 7 (model_name, label, kwargs) tuples used by
Strategy D / E / F, each pinned to the main commit SHA observed via
huggingface_hub.HfApi on 2026-05-21.
- registry_sha_summary(): JSON-serializable mapping for run_meta blocks.
UniXcoder 5604afdc964f6c53782a6813140ade5216b99006
MiniLM-L12 e8f8c211226b894fcb81acc59f3b34ba3efd5f42
Nomic v1.5 e9b6763023c676ca8431644204f50c2b100d9aab
E5-small 614241f622f53c4eeff9890bdc4f31cfecc418b3
E5-base d128750597153bb5987e10b1c3493a34e5a4502a
E5-large 3d7cfbdacd47fdda877c5cd8a79fbcc4f2a574f3
BGE-M3 5617a9f61b028005a4858fdac845db406aefb181
Refactored 3 runners to:
from src.model_registry import MODELS_7_FROZEN, registry_sha_summary
MODELS = MODELS_7_FROZEN
Inline MODELS lists removed from Strategy D / E / F. Each runner's
run_meta now embeds `model_revisions: {model: sha}` so SHAs are recorded
in every results JSON for forensic reproducibility.
sentence-transformers >=5.5 accepts revision= in SentenceTransformer.__init__
(confirmed via inspect.signature on the installed 5.5.1). Embedding-level
reproducibility is additionally guaranteed by EmbeddingCache (.npz keyed
by (model_name, text_hash)).
Docs:
- experiments/README.md Reproducibility envelope bullet added with
the model-weight pinning policy + pointer to the refresh snippet.
Decisions log:
- planning/decisions.md: 2026-05-21 model registry entry.
Closes C3 from the 2026-05-21 pre-experiment review.
heznpc
added a commit
that referenced
this pull request
May 28, 2026
Critical fixes (paper claim integrity):
- V8: Strategy D/E/F sys.exit(2) on any failed model unless
Z_GAP_ALLOW_PARTIAL_RESULTS=1. Holm-Bonferroni's "across 35 cells"
claim can no longer be silently invalidated by a model dropout.
- V1/V14: SentenceTransformerEmbedder.name now includes full repo path
(org/name → org__name) and `@<sha8>` revision suffix. C3 closure
(PR #7) now actually holds end-to-end against SHA bumps and across
org/name basename collisions.
- V2: run_cross_experiment_synthesis._normalize_results_envelope()
unwraps {_meta, results} so legacy consumers keep working with the
new D/E/F JSON shape; strategy_e and strategy_f added to known files.
- V3: paper §5.5 / Limitations "20 cells / four models / 20/20" drift
updated to "35 cells / seven models / 35/35 + OOD 35/35" in 3 places
(L463 method, L516 P2-resolution, L663 Limitations).
Statistical fixes (correctness):
- V5: compute_per_language_R_code substitutes NaN (not 1.0) when
d_match_perm is empty or bootstrap mean_m ≤ 1e-10; np.nanmean for
random_baseline_R aggregation. New null R range [1.0001, 1.0046]
(tier1) / [1.0005, 1.0086] (OOD), unbiased by silent 1.0 imputations.
- V6: permutation p-value uses (k+1)/(n+1) convention. No cell reports
literal 0.0 anymore (verified post-rerun: min nonzero p = 0.0001
across all 70 D+F cells). Reviewer push-back surface closed.
Robustness / pitfall fixes:
- V7: Strategy D/E/F save JSON BEFORE generating figures; figures in
try/except. Multi-hour compute no longer lost to matplotlib fail.
- V9: MistralEmbedder Retry sets respect_retry_after_header=False;
bounded by backoff_factor=1 (~31s worst case), eliminating the
multi-hour silent stall mode on server-sent Retry-After.
- V10: OpenAI client timeout 60s → 300s; legacy batch callers no longer
regress on slow server-side processing.
- V11: Strategy E replaces categories[op_id] with categories.get() +
_label() helper. Empty test sets produce {skip:true} cells with
NaN accuracy instead of crashing on clf.predict(np.array([])).
- V12: load_ood_stimuli() asserts tier2/tier3 op_id uniqueness with the
duplicate list in the error message. 50/50 unique today; future
collision will fail loudly.
- V13: EmbeddingCache._key() switched from `|`-joined string to a
JSON-encoded payload hash. ['a|b','c'] and ['a','b|c'] now hash
to distinct keys.
- V18: SentenceTransformerEmbedder.dimension falls back to a single-text
encode probe when the deprecated get_sentence_embedding_dimension()
returns None. Nomic v1.5 no longer risks int(None) silent skip.
- V20: synthesis script counts aggregate as a 6th language → explicit
`if lang == "aggregate": continue` in the per-language counter.
Hygiene:
- V4: Strategy D datetime.datetime.utcnow() → datetime.now(datetime.UTC),
matching E/F and surviving future Python ≥3.13 removal.
Re-execution:
- D/E/F rerun successfully (7/7 models each, ~5 min wall time).
- 35/35 + multi-model P3 + 35/35 OOD all preserved.
- 2-decimal R_code values unchanged except UniXcoder tier1 (1.0649 ≈ 1.06,
was printed 1.07).
- OOD Cohen's d_max E5-large 4.12 → E5-base 4.42; paper updated.
Decisions log:
- planning/decisions.md: 2026-05-21 entry covering all 15 fixes with
per-finding rationale and the re-execution outcome.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added experiments/src/model_registry.py:
Strategy D / E / F, each pinned to the main commit SHA observed via
huggingface_hub.HfApi on 2026-05-21.
UniXcoder 5604afdc964f6c53782a6813140ade5216b99006
MiniLM-L12 e8f8c211226b894fcb81acc59f3b34ba3efd5f42
Nomic v1.5 e9b6763023c676ca8431644204f50c2b100d9aab
E5-small 614241f622f53c4eeff9890bdc4f31cfecc418b3
E5-base d128750597153bb5987e10b1c3493a34e5a4502a
E5-large 3d7cfbdacd47fdda877c5cd8a79fbcc4f2a574f3
BGE-M3 5617a9f61b028005a4858fdac845db406aefb181
Refactored 3 runners to:
from src.model_registry import MODELS_7_FROZEN, registry_sha_summary
MODELS = MODELS_7_FROZEN
Inline MODELS lists removed from Strategy D / E / F. Each runner's
run_meta now embeds
model_revisions: {model: sha}so SHAs are recordedin every results JSON for forensic reproducibility.
sentence-transformers >=5.5 accepts revision= in SentenceTransformer.init
(confirmed via inspect.signature on the installed 5.5.1). Embedding-level
reproducibility is additionally guaranteed by EmbeddingCache (.npz keyed
by (model_name, text_hash)).
Docs:
the model-weight pinning policy + pointer to the refresh snippet.
Decisions log:
Closes C3 from the 2026-05-21 pre-experiment review.