fix(smoke): template the --probe-smoke seed; fixes false BrokenPunctLoop on Gemma4 unified 12B 4-bit (#121)#124
Merged
Conversation
… on gemma4-unified 4-bit
The --probe-smoke heuristic fed the model a bare instruction prompt with no
turn markers. Instruction-tuned snapshots can degenerate into a repeated filler
token on such input — the mlx-lm reference loader reproduces this identically,
and the mxfp8 gemma-4-12B build degenerates the same way (to '.'/'_'). The QAT
4-bit unified 12B snapshots happen to land on '1', which the classifier flags
as BrokenPunctLoop, while the served chat-template path generates correctly
("The capital of France is Paris.").
This was a false positive in the probe, not a 4-bit dequant bug: rMLX's step-0
logits match the oracle to within quant noise (top id 236770 '1', |max|~28.4 vs
28.5). The per-tensor mixed 4/8-bit quant overrides are resolved correctly.
Fix generally: render the fixed smoke seed through the snapshot's real
chat_template.jinja when present (production-shaped, turn-structured input),
falling back to the bare seed for base/non-chat snapshots. New
chat_template::smoke_prompt_ids owns this; both the rmlx info probe and the
server --require-smoke-probe gate use it. run_smoke_probe gains an optional
caller-supplied prompt-ids override so rmlx-models stays free of the template
engine dep.
- crates/rmlx-server/src/chat_template.rs: smoke_prompt_ids + templated builder
- crates/rmlx-models/src/arch/loader.rs: run_smoke_probe prompt-ids override
- crates/rmlx-server/src/openai/state.rs: render templated probe prompt at gate
- crates/rmlx-cli/src/commands/info.rs: use templated smoke_prompt_ids
- chat_template_tests.rs: template-path + bare-seed fallback unit tests
- docs/CLI.md, docs/MODELS.md: document the templated probe + 4-bit text status
…fallback
The server --require-smoke-probe bare-seed fallback computed BOS as
token_to_id("<bos>").unwrap_or(2), a hardcoded string + magic id that
seeded the probe wrong for chat-template-less vocabs lacking <bos> (and,
passed as Some(ids), suppressed run_smoke_probe's own resolve_bos_id).
smoke_prompt_ids now returns Option<Vec<u32>> from the chat-template render
path only; when no usable template exists it returns None so each entry
point uses its own canonical BOS resolver: run_smoke_probe::resolve_bos_id
(server) and load_bos_id + arch::smoke_prompt_ids (CLI info). No token id is
invented in chat_template.rs.
Per the traceability rule, the templated->bare-seed fallback now emits a
debug! event carrying the failure reason (load/compile/render/encode/empty),
so a run's .jsonl shows whether the probe ran templated or fell back and why.
Also fix a stale doc reference (ChatTemplate::from_template_string ->
ChatTemplate::new) and update the None-path unit test to assert no magic id
2 leaks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #121.
Finding: the model was never broken — the smoke probe was
Issue #121 reported Gemma4 unified 12B emitting degenerate output
(
BrokenPunctLooprepeating'1', id 236770) under 4-bit weight quant, whilethe mxfp8 build of the same model passed. The "4-bit dequant mis-handles a
unified weight" premise was falsified by an mlx-lm oracle bisect:
"What is the capital of France?", nochat turn markers), the mlx-lm reference loader produces the same step-0
top token — id 236770
'1', logit 26.38 — as rMLX. So rMLX's 4-bit forward isnumerically faithful; there is no divergence.
./_filler that slips past the
BrokenPunctLoopclassifier, whereas the 4-bitvariants land on
'1'which trips it. Pure quant-noise token selection."The capital of France is Paris." correctly — and always did.
Root cause:
--probe-smokeseeded an out-of-distribution bare instruction(no turn structure) into an instruction-tuned chat model. The verdict, not the
model, was wrong.
Fix (general)
Render the smoke seed through the snapshot's own
chat_template.jinjawhenpresent (production-shaped, turn-structured input — byte-identical to the real
chat path:
add_generation_prompt: true, same BOS source,add_special_tokens=false),falling back to the bare seed for base/non-chat snapshots. The classifier is
unchanged, so genuinely broken models are still rejected.
rmlx_server::chat_template::smoke_prompt_ids→Option<Vec<u32>>(templated path only;
Nonewhen no usable template).run_smoke_probegained an optional caller-supplied prompt-ids override sormlx-modelsstays free of the template-engine dep.info, server--require-smoke-probe) feed thetemplated prompt; on
Noneeach uses its own canonical BOS resolver(
bos_token → <bos> → <|im_start|> → eos_token → <|endoftext|>) — no hardcoded<bos>/magic id (review fix).tracing::debug!with the reason(traceability hard rule).
Proof (real models, single-MLX)
rmlx info --probe-smoke:gemma-4-12B-it-qat-4bitBrokenPunctLoop{1}gemma-4-12B-it-qat-mxfp4BrokenPunctLoop{1}gemma-4-12B-it-mxfp8gemma-4-E4B-it-qat-4bitgemma-4-26B-A4B-it-qat-4bitAll controls now produce coherent
Paris-class output (stronger than the priorbenign-token-by-luck pass). No regression.
make lint(-D warnings) andmake testgreen; reviewed by the rust-reviewer agent (BOS + traceabilityfindings fixed in the second commit).
🤖 Generated with Claude Code