fix: force Qwen2TokenizerFast for qwen2 model types#369
Conversation
d7c7c66 to
0ae4dbc
Compare
There was a problem hiding this comment.
Code Review
This pull request updates the model initialization in src/heretic/model.py to load the configuration dictionary and explicitly use Qwen2TokenizerFast when the model type is "qwen2" to prevent upstream configuration issues. The reviewer suggested making this check more robust by matching any model type starting with "qwen2" (e.g., "qwen2_moe", "qwen2_vl") instead of performing an exact match.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if config_dict.get("model_type") == "qwen2": | ||
| from transformers import Qwen2TokenizerFast # ty:ignore[unresolved-import] | ||
|
|
||
| self.tokenizer = Qwen2TokenizerFast.from_pretrained( | ||
| settings.model, | ||
| **tokenizer_kwargs, |
There was a problem hiding this comment.
Checking only for "qwen2" might miss other Qwen2-family model types (such as "qwen2_moe", "qwen2_vl", or "qwen2_audio") which also use the Qwen2 tokenizer and could suffer from similar upstream configuration issues. Checking if the model_type starts with "qwen2" is more robust and future-proof.
| if config_dict.get("model_type") == "qwen2": | |
| from transformers import Qwen2TokenizerFast # ty:ignore[unresolved-import] | |
| self.tokenizer = Qwen2TokenizerFast.from_pretrained( | |
| settings.model, | |
| **tokenizer_kwargs, | |
| model_type = config_dict.get("model_type") | |
| if isinstance(model_type, str) and model_type.startswith("qwen2"): | |
| from transformers import Qwen2TokenizerFast | |
| self.tokenizer = Qwen2TokenizerFast.from_pretrained( | |
| settings.model, | |
| **tokenizer_kwargs, | |
| ) |
|
I really think this should be fixed in the model config, and I certainly don't think we should hardcode such hacks into our code. It sucks that they got it wrong, but we're not going to riddle our code with hacks to clean up their mess. |
|
Fair point. Keeping the codebase clean of model-specific workarounds is definitely the right priority. I'll close this PR and see if we can get it resolved upstream on the Hugging Face repository instead. |
This PR fixes a bug where Qwen2-based models (specifically distilled models like
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) output byte-fallback characters (Ġ,Ċ) and miss spacing when using the CLI chat or evaluation loop.Some upstream model repositories specify
LlamaTokenizerFastas thetokenizer_classin theirtokenizer_config.jsonconfiguration. When loaded viaAutoTokenizer, this causes Hugging Face to apply Llama-style token decoding boundaries on Qwen's tiktoken BPE structure, leading to broken character offsets and missing spaces.model_typeis"qwen2", force the loader to initializeQwen2TokenizerFastdirectly.AutoTokenizerunmodified.