docs+experiments(z-gap): pre-experiment review fixes (C1/C2/C3 + M1/M2/M3 + M4)#3
Merged
Merged
Conversation
…2/M3 + M4) Critical (paper integrity): - C1 pretraining contamination caveat: new paragraph in paper §5.5 NL-Code Alignment + Limitations bullet. R_code > 1 reframed as "at least as strong as pretraining co-occurrence statistics would predict", not as independent evidence for Z_sem convergence beyond training-data overlap. Decisive separation deferred to tier2/tier3 OOD stimuli. - C2 random-matching baseline framing: §5.5 protocol sentence now explicitly identifies permutation test (n=10,000) as the random-matching baseline with null R ≈ 1. compute_per_language_R_code() now exports the null distribution mean/std/p95 to results JSON. - C3 HuggingFace revision policy: documented in run_strategy_d_code_alignment.py header. Pilot accepts floating-main risk and relies on EmbeddingCache for embedding-level reproducibility; explicit revision= pin deferred as Minor TODO. Major: - M1 stimulus complexity: new Limitations paragraph stating conclusions apply to stdlib-idiom-level operations only. - M2 translation provenance: new Limitations paragraph stating no formal IAA; translations were first-author + LLM-assisted + bilingual review. - M3 model robustness wrap: per-model try/except in run loop; OOM / trust-remote-code / network failure of one model skips that cell instead of aborting the 7-model sweep. - M4 prior art: web-search confirmed no per-language × per-model NL-code matrix exists; "to our knowledge, first" qualifier added to §5.5. Strategy D extension (this session's experiment): - MODELS extended 4 -> 7: + E5-small, E5-base, BGE-M3. M5 (P3 multi-model probing) deferred to follow-up PR. M6 (Codestral Embed) excluded — no MISTRAL_API_KEY in this session. - Run meta block (started/finished UTC, Python/torch/sentence-transformers versions, seed, n_perm, n_boot, failed_models) written to results JSON. Decisions log: - planning/decisions.md: 2026-05-21 entry documenting all C/M fixes and the scope choices for M5/M6.
heznpc
added a commit
that referenced
this pull request
May 20, 2026
… falsified (#6) C1 deferred portion (OOD test for the contamination caveat from PR #3) now closed. The pre-registered prediction was that multi-step / compositional OOD operations should show LOWER R_code than tier-1 stdlib 1-liners if the tier-1 effect was primarily pretraining memorization. Observed direction is the opposite: every model shows STRONGER alignment on OOD. New runner: experiments/scripts/run_strategy_f_ood_alignment.py - 50 OOD ops: 30 tier-2 multi-step (binary_search, BFS, merge_sort, ...) + 20 tier-3 compositional (bellman_ford, topological_sort, A*, ...) - Same 7-model set, same statistics (permutation n=10k + bootstrap n=10k + Holm-Bonferroni) as Strategy D. Results (OOD aggregate vs tier-1 aggregate): Model tier1 OOD Δ ──────────────────────────────────── UniXcoder (code) 1.07 1.15 +0.08 MiniLM-L12 (NL) 1.16 1.31 +0.15 Nomic v1.5 1.07 1.16 +0.09 E5-small (NL) 1.13 1.28 +0.15 E5-base (NL) 1.14 1.31 +0.17 E5-large (NL) 1.20 1.33 +0.13 BGE-M3 (NL+code) 1.16 1.36 +0.20 35/35 OOD cells significant (p < 0.05 Holm-Bonferroni) Cohen's d up to 4.12 (en, E5-large) Permutation-null R in [1.004, 1.008] Interpretation: multi-step algorithm NL descriptions are longer and more distinctive (mean 180 chars vs 55 for tier-1), and multi-line function bodies are stronger signal carriers than 1-liners. The embedding alignment exploits this richer surface form rather than being damaged by reduced co-occurrence frequency. NL-code alignment is NOT primarily memorization- driven. Paper: - §5.5 contamination caveat: "left to future work" framing removed; now points to OOD experiment below. - §5.5 new "Out-of-distribution NL-code alignment" paragraph + 7×5 OOD table + tier1↔OOD aggregate comparison + interpretation. - Limitations "Pretraining contamination" bullet renamed to "(partially addressed)" with summary of OOD result; residual matched-perplexity work remains future. Decisions log: - planning/decisions.md: 2026-05-21 Strategy F entry covering the pre- registered hypothesis structure (recorded in the runner docstring before running, not post-hoc), the observed direction, and the resulting paper revisions. Closes C1 deferred portion. C3 (revision SHA pin) + Minor TODOs remain.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Critical (paper integrity):
Alignment + Limitations bullet. R_code > 1 reframed as "at least as strong
as pretraining co-occurrence statistics would predict", not as independent
evidence for Z_sem convergence beyond training-data overlap. Decisive
separation deferred to tier2/tier3 OOD stimuli.
identifies permutation test (n=10,000) as the random-matching baseline with
null R ≈ 1. compute_per_language_R_code() now exports the null distribution
mean/std/p95 to results JSON.
run_strategy_d_code_alignment.py header. Pilot accepts floating-main risk
and relies on EmbeddingCache for embedding-level reproducibility; explicit
revision= pin deferred as Minor TODO.
Major:
to stdlib-idiom-level operations only.
translations were first-author + LLM-assisted + bilingual review.
trust-remote-code / network failure of one model skips that cell instead of
aborting the 7-model sweep.
matrix exists; "to our knowledge, first" qualifier added to §5.5.
Strategy D extension (this session's experiment):
probing) deferred to follow-up PR. M6 (Codestral Embed) excluded — no
MISTRAL_API_KEY in this session.
versions, seed, n_perm, n_boot, failed_models) written to results JSON.
Decisions log:
scope choices for M5/M6.