Skip to content

Fix detokenizer for byte-level tokenizers with an SPM-style decoder (fixes #1041)#1329

Open
robertlangdonn wants to merge 1 commit into
ml-explore:mainfrom
robertlangdonn:fix-tekken-detokenizer-spaces
Open

Fix detokenizer for byte-level tokenizers with an SPM-style decoder (fixes #1041)#1329
robertlangdonn wants to merge 1 commit into
ml-explore:mainfrom
robertlangdonn:fix-tekken-detokenizer-spaces

Conversation

@robertlangdonn
Copy link
Copy Markdown

Fixes #1041.

Summary

mlx_lm.server emits the byte-level space marker Ġ (U+0120) instead of spaces for Mistral's tekken v13 tokenizers (Devstral-Small-2, etc.):

before: Hello!ĠHowĠcanĠIĠassistĠyouĠtoday?
after:  Hello! How can I assist you today?

Root cause

tokenizer_utils.load() selects the streaming detokenizer by inspecting only the decoder field of tokenizer.json. The tekken v13 tokenizer is a hybrid:

  • its decoder is the classic SentencePiece sequence (Replace ▁→space, ByteFallback, Fuse, Strip), so it matches _is_spm_decoder and is routed to SPMStreamingDetokenizer;
  • but its pre_tokenizer is ByteLevel, so the vocabulary actually uses GPT-2-style byte markers (Ġ for spaces).

SPMStreamingDetokenizer only strips the marker, so the Ġ markers pass through untouched. (Notably, transformers' own tokenizer.decode() also mis-decodes this tokenizer; BPEStreamingDetokenizer produces correct output where HF does not.)

Fix

A ByteLevel pre-tokenizer is authoritative: the vocabulary is byte-level and must use BPEStreamingDetokenizer, regardless of the decoder shape. Added _has_byte_level_pretokenizer() and prefer the BPE detokenizer when it (or a ByteLevel decoder) is present. Genuine SPM tokenizers (no ByteLevel pre-tokenizer) are unaffected.

Testing

  • test_byte_level_vocab_with_spm_decoder — regression for Devstral (tekken v13 tokenizer) produces Ġ characters instead of spaces in output #1041: asserts BPE routing and exact, Ġ-free output for the tekken v13 tokenizer (downloads tokenizer files only).
  • test_has_byte_level_pretokenizer — unit test for the new helper.
  • Verified the existing test_tokenizers detokenizer-class expectations still hold: the SPM tokenizers (Mistral-7B-v0.2/v0.3, Phi-3.5) stay SPMStreamingDetokenizer and the BPE ones stay BPEStreamingDetokenizer; only the tekken case changes.

Mistral tekken v13 tokenizers (Devstral-Small-2, etc.) carry an SPM-style
decoder ("Replace ▁->space, ByteFallback, Fuse, Strip") but a ByteLevel
pre-tokenizer, so their vocabulary uses GPT-2 byte markers. load() selected
the detokenizer from the decoder field alone, matching _is_spm_decoder and
routing them to SPMStreamingDetokenizer, which only strips the "▁" marker and
left the byte-level space marker "Ġ" (U+0120) in the output.

Trust the pre-tokenizer: a ByteLevel vocabulary must use BPEStreamingDetokenizer
regardless of the decoder shape. Genuine SPM tokenizers (no ByteLevel
pre-tokenizer) are unaffected.

Fixes ml-explore#1041.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Devstral (tekken v13 tokenizer) produces Ġ characters instead of spaces in output

1 participant