Skip to content

fix: force Qwen2TokenizerFast for qwen2 model types#369

Closed
umran666 wants to merge 1 commit into
p-e-w:masterfrom
umran666:fix/qwen-tokenizer-decoding
Closed

fix: force Qwen2TokenizerFast for qwen2 model types#369
umran666 wants to merge 1 commit into
p-e-w:masterfrom
umran666:fix/qwen-tokenizer-decoding

Conversation

@umran666

@umran666 umran666 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This PR fixes a bug where Qwen2-based models (specifically distilled models like deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) output byte-fallback characters (Ġ, Ċ) and miss spacing when using the CLI chat or evaluation loop.

Some upstream model repositories specify LlamaTokenizerFast as the tokenizer_class in their tokenizer_config.json configuration. When loaded via AutoTokenizer, this causes Hugging Face to apply Llama-style token decoding boundaries on Qwen's tiktoken BPE structure, leading to broken character offsets and missing spaces.

  • Intercept tokenizer loading by reading the model's configuration dictionary.
  • If model_type is "qwen2", force the loader to initialize Qwen2TokenizerFast directly.
  • Standard models continue to fall back to AutoTokenizer unmodified.

@umran666 umran666 force-pushed the fix/qwen-tokenizer-decoding branch from d7c7c66 to 0ae4dbc Compare June 10, 2026 15:25

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the model initialization in src/heretic/model.py to load the configuration dictionary and explicitly use Qwen2TokenizerFast when the model type is "qwen2" to prevent upstream configuration issues. The reviewer suggested making this check more robust by matching any model type starting with "qwen2" (e.g., "qwen2_moe", "qwen2_vl") instead of performing an exact match.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/heretic/model.py
Comment on lines +90 to +95
if config_dict.get("model_type") == "qwen2":
from transformers import Qwen2TokenizerFast # ty:ignore[unresolved-import]

self.tokenizer = Qwen2TokenizerFast.from_pretrained(
settings.model,
**tokenizer_kwargs,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Checking only for "qwen2" might miss other Qwen2-family model types (such as "qwen2_moe", "qwen2_vl", or "qwen2_audio") which also use the Qwen2 tokenizer and could suffer from similar upstream configuration issues. Checking if the model_type starts with "qwen2" is more robust and future-proof.

Suggested change
if config_dict.get("model_type") == "qwen2":
from transformers import Qwen2TokenizerFast # ty:ignore[unresolved-import]
self.tokenizer = Qwen2TokenizerFast.from_pretrained(
settings.model,
**tokenizer_kwargs,
model_type = config_dict.get("model_type")
if isinstance(model_type, str) and model_type.startswith("qwen2"):
from transformers import Qwen2TokenizerFast
self.tokenizer = Qwen2TokenizerFast.from_pretrained(
settings.model,
**tokenizer_kwargs,
)

@p-e-w

p-e-w commented Jun 10, 2026

Copy link
Copy Markdown
Owner

I really think this should be fixed in the model config, and I certainly don't think we should hardcode such hacks into our code.

It sucks that they got it wrong, but we're not going to riddle our code with hacks to clean up their mess.

@umran666

Copy link
Copy Markdown
Contributor Author

Fair point. Keeping the codebase clean of model-specific workarounds is definitely the right priority. I'll close this PR and see if we can get it resolved upstream on the Hugging Face repository instead.

@umran666 umran666 closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants