Skip to content

fix(smoke): template the --probe-smoke seed; fixes false BrokenPunctLoop on Gemma4 unified 12B 4-bit (#121)#124

Merged
Pushkinist merged 2 commits into
mainfrom
fix/121-gemma4-unified-4bit-degenerate
Jun 17, 2026
Merged

fix(smoke): template the --probe-smoke seed; fixes false BrokenPunctLoop on Gemma4 unified 12B 4-bit (#121)#124
Pushkinist merged 2 commits into
mainfrom
fix/121-gemma4-unified-4bit-degenerate

Conversation

@Pushkinist

Copy link
Copy Markdown
Owner

Closes #121.

Finding: the model was never broken — the smoke probe was

Issue #121 reported Gemma4 unified 12B emitting degenerate output
(BrokenPunctLoop repeating '1', id 236770) under 4-bit weight quant, while
the mxfp8 build of the same model passed. The "4-bit dequant mis-handles a
unified weight" premise was falsified by an mlx-lm oracle bisect:

  • On the identical bare seed prompt ("What is the capital of France?", no
    chat turn markers), the mlx-lm reference loader produces the same step-0
    top token — id 236770 '1', logit 26.38 — as rMLX. So rMLX's 4-bit forward is
    numerically faithful; there is no divergence.
  • The mxfp8 build also degenerates on that bare prompt — it just lands on ./_
    filler that slips past the BrokenPunctLoop classifier, whereas the 4-bit
    variants land on '1' which trips it. Pure quant-noise token selection.
  • Served via the chat template, all variants (4bit / mxfp4 / mxfp8) answer
    "The capital of France is Paris." correctly — and always did.

Root cause: --probe-smoke seeded an out-of-distribution bare instruction
(no turn structure) into an instruction-tuned chat model. The verdict, not the
model, was wrong.

Fix (general)

Render the smoke seed through the snapshot's own chat_template.jinja when
present (production-shaped, turn-structured input — byte-identical to the real
chat path: add_generation_prompt: true, same BOS source, add_special_tokens=false),
falling back to the bare seed for base/non-chat snapshots. The classifier is
unchanged, so genuinely broken models are still rejected.

  • New rmlx_server::chat_template::smoke_prompt_idsOption<Vec<u32>>
    (templated path only; None when no usable template).
  • run_smoke_probe gained an optional caller-supplied prompt-ids override so
    rmlx-models stays free of the template-engine dep.
  • Both probe entry points (CLI info, server --require-smoke-probe) feed the
    templated prompt; on None each uses its own canonical BOS resolver
    (bos_token → <bos> → <|im_start|> → eos_token → <|endoftext|>) — no hardcoded
    <bos>/magic id (review fix).
  • Silent template-render fallback now emits a tracing::debug! with the reason
    (traceability hard rule).

Proof (real models, single-MLX)

rmlx info --probe-smoke:

snapshot arch quant before after
gemma-4-12B-it-qat-4bit unified affine-4bit BrokenPunctLoop{1} Ok → "…Paris."
gemma-4-12B-it-qat-mxfp4 unified mxfp4 BrokenPunctLoop{1} Ok → "…Paris."
gemma-4-12B-it-mxfp8 unified mxfp8 Ok (benign token) Ok → "…Paris"
gemma-4-E4B-it-qat-4bit gemma4 affine-4bit Ok Ok → "…Paris"
gemma-4-26B-A4B-it-qat-4bit gemma4 MoE affine-4bit Ok Ok → "…Paris"

All controls now produce coherent Paris-class output (stronger than the prior
benign-token-by-luck pass). No regression. make lint (-D warnings) and
make test green; reviewed by the rust-reviewer agent (BOS + traceability
findings fixed in the second commit).

🤖 Generated with Claude Code

… on gemma4-unified 4-bit

The --probe-smoke heuristic fed the model a bare instruction prompt with no
turn markers. Instruction-tuned snapshots can degenerate into a repeated filler
token on such input — the mlx-lm reference loader reproduces this identically,
and the mxfp8 gemma-4-12B build degenerates the same way (to '.'/'_'). The QAT
4-bit unified 12B snapshots happen to land on '1', which the classifier flags
as BrokenPunctLoop, while the served chat-template path generates correctly
("The capital of France is Paris.").

This was a false positive in the probe, not a 4-bit dequant bug: rMLX's step-0
logits match the oracle to within quant noise (top id 236770 '1', |max|~28.4 vs
28.5). The per-tensor mixed 4/8-bit quant overrides are resolved correctly.

Fix generally: render the fixed smoke seed through the snapshot's real
chat_template.jinja when present (production-shaped, turn-structured input),
falling back to the bare seed for base/non-chat snapshots. New
chat_template::smoke_prompt_ids owns this; both the rmlx info probe and the
server --require-smoke-probe gate use it. run_smoke_probe gains an optional
caller-supplied prompt-ids override so rmlx-models stays free of the template
engine dep.

- crates/rmlx-server/src/chat_template.rs: smoke_prompt_ids + templated builder
- crates/rmlx-models/src/arch/loader.rs: run_smoke_probe prompt-ids override
- crates/rmlx-server/src/openai/state.rs: render templated probe prompt at gate
- crates/rmlx-cli/src/commands/info.rs: use templated smoke_prompt_ids
- chat_template_tests.rs: template-path + bare-seed fallback unit tests
- docs/CLI.md, docs/MODELS.md: document the templated probe + 4-bit text status
…fallback

The server --require-smoke-probe bare-seed fallback computed BOS as
token_to_id("<bos>").unwrap_or(2), a hardcoded string + magic id that
seeded the probe wrong for chat-template-less vocabs lacking <bos> (and,
passed as Some(ids), suppressed run_smoke_probe's own resolve_bos_id).

smoke_prompt_ids now returns Option<Vec<u32>> from the chat-template render
path only; when no usable template exists it returns None so each entry
point uses its own canonical BOS resolver: run_smoke_probe::resolve_bos_id
(server) and load_bos_id + arch::smoke_prompt_ids (CLI info). No token id is
invented in chat_template.rs.

Per the traceability rule, the templated->bare-seed fallback now emits a
debug! event carrying the failure reason (load/compile/render/encode/empty),
so a run's .jsonl shows whether the probe ran templated or fell back and why.

Also fix a stale doc reference (ChatTemplate::from_template_string ->
ChatTemplate::new) and update the None-path unit test to assert no magic id
2 leaks.
@Pushkinist Pushkinist merged commit e1bf8a3 into main Jun 17, 2026
2 checks passed
@Pushkinist Pushkinist deleted the fix/121-gemma4-unified-4bit-degenerate branch June 17, 2026 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma4 unified 12B: degenerate output under 4-bit weight quant (affine-4bit + mxfp4); mxfp8 OK

1 participant