Skip to content

experiments(z-gap): frozen model registry with HuggingFace SHAs (closes C3)#7

Merged
heznpc merged 1 commit into
mainfrom
chore/model-registry-sha-pin-2026-05-21
May 20, 2026
Merged

experiments(z-gap): frozen model registry with HuggingFace SHAs (closes C3)#7
heznpc merged 1 commit into
mainfrom
chore/model-registry-sha-pin-2026-05-21

Conversation

@heznpc

@heznpc heznpc commented May 20, 2026

Copy link
Copy Markdown
Owner

Added experiments/src/model_registry.py:

  • MODELS_7_FROZEN: 7 (model_name, label, kwargs) tuples used by
    Strategy D / E / F, each pinned to the main commit SHA observed via
    huggingface_hub.HfApi on 2026-05-21.
  • registry_sha_summary(): JSON-serializable mapping for run_meta blocks.

UniXcoder 5604afdc964f6c53782a6813140ade5216b99006
MiniLM-L12 e8f8c211226b894fcb81acc59f3b34ba3efd5f42
Nomic v1.5 e9b6763023c676ca8431644204f50c2b100d9aab
E5-small 614241f622f53c4eeff9890bdc4f31cfecc418b3
E5-base d128750597153bb5987e10b1c3493a34e5a4502a
E5-large 3d7cfbdacd47fdda877c5cd8a79fbcc4f2a574f3
BGE-M3 5617a9f61b028005a4858fdac845db406aefb181

Refactored 3 runners to:
from src.model_registry import MODELS_7_FROZEN, registry_sha_summary
MODELS = MODELS_7_FROZEN
Inline MODELS lists removed from Strategy D / E / F. Each runner's
run_meta now embeds model_revisions: {model: sha} so SHAs are recorded
in every results JSON for forensic reproducibility.

sentence-transformers >=5.5 accepts revision= in SentenceTransformer.init
(confirmed via inspect.signature on the installed 5.5.1). Embedding-level
reproducibility is additionally guaranteed by EmbeddingCache (.npz keyed
by (model_name, text_hash)).

Docs:

  • experiments/README.md Reproducibility envelope bullet added with
    the model-weight pinning policy + pointer to the refresh snippet.

Decisions log:

  • planning/decisions.md: 2026-05-21 model registry entry.

Closes C3 from the 2026-05-21 pre-experiment review.

…es C3)

Added experiments/src/model_registry.py:
  - MODELS_7_FROZEN: 7 (model_name, label, kwargs) tuples used by
    Strategy D / E / F, each pinned to the main commit SHA observed via
    huggingface_hub.HfApi on 2026-05-21.
  - registry_sha_summary(): JSON-serializable mapping for run_meta blocks.

  UniXcoder       5604afdc964f6c53782a6813140ade5216b99006
  MiniLM-L12      e8f8c211226b894fcb81acc59f3b34ba3efd5f42
  Nomic v1.5      e9b6763023c676ca8431644204f50c2b100d9aab
  E5-small        614241f622f53c4eeff9890bdc4f31cfecc418b3
  E5-base         d128750597153bb5987e10b1c3493a34e5a4502a
  E5-large        3d7cfbdacd47fdda877c5cd8a79fbcc4f2a574f3
  BGE-M3          5617a9f61b028005a4858fdac845db406aefb181

Refactored 3 runners to:
  from src.model_registry import MODELS_7_FROZEN, registry_sha_summary
  MODELS = MODELS_7_FROZEN
Inline MODELS lists removed from Strategy D / E / F. Each runner's
run_meta now embeds `model_revisions: {model: sha}` so SHAs are recorded
in every results JSON for forensic reproducibility.

sentence-transformers >=5.5 accepts revision= in SentenceTransformer.__init__
(confirmed via inspect.signature on the installed 5.5.1). Embedding-level
reproducibility is additionally guaranteed by EmbeddingCache (.npz keyed
by (model_name, text_hash)).

Docs:
- experiments/README.md Reproducibility envelope bullet added with
  the model-weight pinning policy + pointer to the refresh snippet.

Decisions log:
- planning/decisions.md: 2026-05-21 model registry entry.

Closes C3 from the 2026-05-21 pre-experiment review.
@heznpc heznpc merged commit 638a9a0 into main May 20, 2026
@heznpc heznpc deleted the chore/model-registry-sha-pin-2026-05-21 branch May 20, 2026 18:08
heznpc added a commit that referenced this pull request May 28, 2026
Critical fixes (paper claim integrity):
- V8: Strategy D/E/F sys.exit(2) on any failed model unless
      Z_GAP_ALLOW_PARTIAL_RESULTS=1. Holm-Bonferroni's "across 35 cells"
      claim can no longer be silently invalidated by a model dropout.
- V1/V14: SentenceTransformerEmbedder.name now includes full repo path
      (org/name → org__name) and `@<sha8>` revision suffix. C3 closure
      (PR #7) now actually holds end-to-end against SHA bumps and across
      org/name basename collisions.
- V2:  run_cross_experiment_synthesis._normalize_results_envelope()
      unwraps {_meta, results} so legacy consumers keep working with the
      new D/E/F JSON shape; strategy_e and strategy_f added to known files.
- V3:  paper §5.5 / Limitations "20 cells / four models / 20/20" drift
      updated to "35 cells / seven models / 35/35 + OOD 35/35" in 3 places
      (L463 method, L516 P2-resolution, L663 Limitations).

Statistical fixes (correctness):
- V5:  compute_per_language_R_code substitutes NaN (not 1.0) when
      d_match_perm is empty or bootstrap mean_m ≤ 1e-10; np.nanmean for
      random_baseline_R aggregation. New null R range [1.0001, 1.0046]
      (tier1) / [1.0005, 1.0086] (OOD), unbiased by silent 1.0 imputations.
- V6:  permutation p-value uses (k+1)/(n+1) convention. No cell reports
      literal 0.0 anymore (verified post-rerun: min nonzero p = 0.0001
      across all 70 D+F cells). Reviewer push-back surface closed.

Robustness / pitfall fixes:
- V7:  Strategy D/E/F save JSON BEFORE generating figures; figures in
      try/except. Multi-hour compute no longer lost to matplotlib fail.
- V9:  MistralEmbedder Retry sets respect_retry_after_header=False;
      bounded by backoff_factor=1 (~31s worst case), eliminating the
      multi-hour silent stall mode on server-sent Retry-After.
- V10: OpenAI client timeout 60s → 300s; legacy batch callers no longer
      regress on slow server-side processing.
- V11: Strategy E replaces categories[op_id] with categories.get() +
      _label() helper. Empty test sets produce {skip:true} cells with
      NaN accuracy instead of crashing on clf.predict(np.array([])).
- V12: load_ood_stimuli() asserts tier2/tier3 op_id uniqueness with the
      duplicate list in the error message. 50/50 unique today; future
      collision will fail loudly.
- V13: EmbeddingCache._key() switched from `|`-joined string to a
      JSON-encoded payload hash. ['a|b','c'] and ['a','b|c'] now hash
      to distinct keys.
- V18: SentenceTransformerEmbedder.dimension falls back to a single-text
      encode probe when the deprecated get_sentence_embedding_dimension()
      returns None. Nomic v1.5 no longer risks int(None) silent skip.
- V20: synthesis script counts aggregate as a 6th language → explicit
      `if lang == "aggregate": continue` in the per-language counter.

Hygiene:
- V4:  Strategy D datetime.datetime.utcnow() → datetime.now(datetime.UTC),
      matching E/F and surviving future Python ≥3.13 removal.

Re-execution:
- D/E/F rerun successfully (7/7 models each, ~5 min wall time).
- 35/35 + multi-model P3 + 35/35 OOD all preserved.
- 2-decimal R_code values unchanged except UniXcoder tier1 (1.0649 ≈ 1.06,
  was printed 1.07).
- OOD Cohen's d_max E5-large 4.12 → E5-base 4.42; paper updated.

Decisions log:
- planning/decisions.md: 2026-05-21 entry covering all 15 fixes with
  per-finding rationale and the re-execution outcome.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant