Fix detokenizer for byte-level tokenizers with an SPM-style decoder (fixes #1041) by robertlangdonn · Pull Request #1329 · ml-explore/mlx-lm

robertlangdonn · 2026-05-30T09:40:32Z

Summary

mlx_lm.server emits the byte-level space marker Ġ (U+0120) instead of spaces for Mistral's tekken v13 tokenizers (Devstral-Small-2, etc.):

before: Hello!ĠHowĠcanĠIĠassistĠyouĠtoday?
after:  Hello! How can I assist you today?

Root cause

tokenizer_utils.load() selects the streaming detokenizer by inspecting only the decoder field of tokenizer.json. The tekken v13 tokenizer is a hybrid:

its decoder is the classic SentencePiece sequence (Replace ▁→space, ByteFallback, Fuse, Strip), so it matches _is_spm_decoder and is routed to SPMStreamingDetokenizer;
but its pre_tokenizer is ByteLevel, so the vocabulary actually uses GPT-2-style byte markers (Ġ for spaces).

SPMStreamingDetokenizer only strips the ▁ marker, so the Ġ markers pass through untouched. (Notably, transformers' own tokenizer.decode() also mis-decodes this tokenizer; BPEStreamingDetokenizer produces correct output where HF does not.)

Fix

A ByteLevel pre-tokenizer is authoritative: the vocabulary is byte-level and must use BPEStreamingDetokenizer, regardless of the decoder shape. Added _has_byte_level_pretokenizer() and prefer the BPE detokenizer when it (or a ByteLevel decoder) is present. Genuine SPM tokenizers (no ByteLevel pre-tokenizer) are unaffected.

Testing

test_byte_level_vocab_with_spm_decoder — regression for Devstral (tekken v13 tokenizer) produces Ġ characters instead of spaces in output #1041: asserts BPE routing and exact, Ġ-free output for the tekken v13 tokenizer (downloads tokenizer files only).
test_has_byte_level_pretokenizer — unit test for the new helper.
Verified the existing test_tokenizers detokenizer-class expectations still hold: the SPM tokenizers (Mistral-7B-v0.2/v0.3, Phi-3.5) stay SPMStreamingDetokenizer and the BPE ones stay BPEStreamingDetokenizer; only the tekken case changes.

Mistral tekken v13 tokenizers (Devstral-Small-2, etc.) carry an SPM-style decoder ("Replace ▁->space, ByteFallback, Fuse, Strip") but a ByteLevel pre-tokenizer, so their vocabulary uses GPT-2 byte markers. load() selected the detokenizer from the decoder field alone, matching _is_spm_decoder and routing them to SPMStreamingDetokenizer, which only strips the "▁" marker and left the byte-level space marker "Ġ" (U+0120) in the output. Trust the pre-tokenizer: a ByteLevel vocabulary must use BPEStreamingDetokenizer regardless of the decoder shape. Genuine SPM tokenizers (no ByteLevel pre-tokenizer) are unaffected. Fixes ml-explore#1041.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix detokenizer for byte-level tokenizers with an SPM-style decoder (fixes #1041)#1329

Fix detokenizer for byte-level tokenizers with an SPM-style decoder (fixes #1041)#1329
robertlangdonn wants to merge 1 commit into
ml-explore:mainfrom
robertlangdonn:fix-tekken-detokenizer-spaces

robertlangdonn commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robertlangdonn commented May 30, 2026

Summary

Root cause

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant