Skip to content

Add native \p{M} (Unicode Mark) regex support for Qwen3.5 tokenizer#1063

Merged
sayanshaw24 merged 1 commit into
mainfrom
asonawane/processing
May 15, 2026
Merged

Add native \p{M} (Unicode Mark) regex support for Qwen3.5 tokenizer#1063
sayanshaw24 merged 1 commit into
mainfrom
asonawane/processing

Conversation

@apsonawane
Copy link
Copy Markdown
Contributor

Summary

Adds hand-coded regex matchers for Qwen3.5's tokenizer pre-tokenization patterns that use \p{M} (Unicode Mark category). Without this change, these patterns fall through to std::regex, which does not support Unicode property escapes and crashes at runtime.

Problem

Qwen3.5 is the only model family whose tokenizer regex includes \p{M}. Its full pre-tokenizer regex:

(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?[\p{L}\p{M}]+|\p{N}| ?[^\s\p{L}\p{M}\p{N}]+[\r\n]|\s[\r\n]+|\s+(?!\S)|\s+

Two sub-patterns contain \p{M}:

Sub-pattern Purpose
[^\r\n\p{L}\p{N}]?[\p{L}\p{M}]+ Word matching (letters + combining marks like diacritics)
?[^\s\p{L}\p{M}\p{N}]+[\r\n]* Punctuation/symbol matching (excludes marks from punctuation class)

The remaining 4 sub-patterns already have existing matchers (LLAMA3, GPT2).

Changes

operators/tokenizer/bpe_utils.hpp

  • Added IsLM() helper — matches [\p{L}\p{M}]
  • Added NotLMNZ() helper — matches [^\s\p{L}\p{M}\p{N}]
  • Added Match_Qwen35_Pattern_1() — implements [^\r\n\p{L}\p{N}]?[\p{L}\p{M}]+
  • Added Match_Qwen35_Pattern_2() — implements ?[^\s\p{L}\p{M}\p{N}]+[\r\n]*
  • Registered both patterns in the Compile() lookup table (before shorter LLAMA3 patterns to avoid shadowing)

test/pp_api_test/test_tokenizer_impl.cc

  • Added Qwen35RegexTest — compiles the full Qwen3.5 regex and verifies tokenization of text with combining marks (e.g., café using U+0301), punctuation, digits, and newlines

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds native pre-tokenizer support for Qwen3.5 regex alternatives that use Unicode mark (\p{M}), avoiding fallback to std::regex for unsupported Unicode property escapes.

Changes:

  • Added Qwen3.5-specific regex matcher functions and Unicode helper predicates.
  • Registered the new Qwen3.5 matcher patterns in PreTokenizerWithRegEx::Compile.
  • Added a regression test covering combining marks, punctuation, digits, whitespace, and newlines.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
operators/tokenizer/bpe_utils.hpp Adds Qwen3.5 regex matcher implementations and pattern registration.
test/pp_api_test/test_tokenizer_impl.cc Adds a tokenizer regex test for Qwen3.5-style \p{M} handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@sayanshaw24 sayanshaw24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks for adding this!

@sayanshaw24 sayanshaw24 enabled auto-merge (squash) May 15, 2026 16:20
@sayanshaw24 sayanshaw24 merged commit f29716e into main May 15, 2026
42 checks passed
@sayanshaw24 sayanshaw24 deleted the asonawane/processing branch May 15, 2026 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants