fix: force Qwen2TokenizerFast for qwen2 model types by umran666 · Pull Request #369 · p-e-w/heretic

umran666 · 2026-06-10T15:20:43Z

This PR fixes a bug where Qwen2-based models (specifically distilled models like deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) output byte-fallback characters (Ġ, Ċ) and miss spacing when using the CLI chat or evaluation loop.

Some upstream model repositories specify LlamaTokenizerFast as the tokenizer_class in their tokenizer_config.json configuration. When loaded via AutoTokenizer, this causes Hugging Face to apply Llama-style token decoding boundaries on Qwen's tiktoken BPE structure, leading to broken character offsets and missing spaces.

Intercept tokenizer loading by reading the model's configuration dictionary.
If model_type is "qwen2", force the loader to initialize Qwen2TokenizerFast directly.
Standard models continue to fall back to AutoTokenizer unmodified.

…er corruption

gemini-code-assist

Code Review

This pull request updates the model initialization in src/heretic/model.py to load the configuration dictionary and explicitly use Qwen2TokenizerFast when the model type is "qwen2" to prevent upstream configuration issues. The reviewer suggested making this check more robust by matching any model type starting with "qwen2" (e.g., "qwen2_moe", "qwen2_vl") instead of performing an exact match.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T15:31:43Z

+        if config_dict.get("model_type") == "qwen2":
+            from transformers import Qwen2TokenizerFast  # ty:ignore[unresolved-import]
+
+            self.tokenizer = Qwen2TokenizerFast.from_pretrained(
+                settings.model,
+                **tokenizer_kwargs,


Checking only for "qwen2" might miss other Qwen2-family model types (such as "qwen2_moe", "qwen2_vl", or "qwen2_audio") which also use the Qwen2 tokenizer and could suffer from similar upstream configuration issues. Checking if the model_type starts with "qwen2" is more robust and future-proof.

Suggested change

if config_dict.get("model_type") == "qwen2":

from transformers import Qwen2TokenizerFast # ty:ignore[unresolved-import]

self.tokenizer = Qwen2TokenizerFast.from_pretrained(

settings.model,

**tokenizer_kwargs,

model_type = config_dict.get("model_type")

if isinstance(model_type, str) and model_type.startswith("qwen2"):

from transformers import Qwen2TokenizerFast

self.tokenizer = Qwen2TokenizerFast.from_pretrained(

settings.model,

**tokenizer_kwargs,

)

p-e-w · 2026-06-10T15:47:11Z

I really think this should be fixed in the model config, and I certainly don't think we should hardcode such hacks into our code.

It sucks that they got it wrong, but we're not going to riddle our code with hacks to clean up their mess.

umran666 · 2026-06-10T15:51:29Z

Fair point. Keeping the codebase clean of model-specific workarounds is definitely the right priority. I'll close this PR and see if we can get it resolved upstream on the Hugging Face repository instead.

fix: load Qwen2TokenizerFast for qwen2 model types to prevent charact…

0ae4dbc

…er corruption

umran666 force-pushed the fix/qwen-tokenizer-decoding branch from d7c7c66 to 0ae4dbc Compare June 10, 2026 15:25

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

umran666 closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: force Qwen2TokenizerFast for qwen2 model types#369

fix: force Qwen2TokenizerFast for qwen2 model types#369
umran666 wants to merge 1 commit into
p-e-w:masterfrom
umran666:fix/qwen-tokenizer-decoding

umran666 commented Jun 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

p-e-w commented Jun 10, 2026

Uh oh!

umran666 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

umran666 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w commented Jun 10, 2026

Uh oh!

umran666 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

umran666 commented Jun 10, 2026 •

edited

Loading