Add native \p{M} (Unicode Mark) regex support for Qwen3.5 tokenizer by apsonawane · Pull Request #1063 · microsoft/onnxruntime-extensions

apsonawane · 2026-05-15T15:57:18Z

Summary

Adds hand-coded regex matchers for Qwen3.5's tokenizer pre-tokenization patterns that use \p{M} (Unicode Mark category). Without this change, these patterns fall through to std::regex, which does not support Unicode property escapes and crashes at runtime.

Problem

Qwen3.5 is the only model family whose tokenizer regex includes \p{M}. Its full pre-tokenizer regex:

(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?[\p{L}\p{M}]+|\p{N}| ?[^\s\p{L}\p{M}\p{N}]+[\r\n]|\s[\r\n]+|\s+(?!\S)|\s+

Two sub-patterns contain \p{M}:

Sub-pattern	Purpose
`[^\r\n\p{L}\p{N}]?[\p{L}\p{M}]+`	Word matching (letters + combining marks like diacritics)
`?[^\s\p{L}\p{M}\p{N}]+[\r\n]*`	Punctuation/symbol matching (excludes marks from punctuation class)

The remaining 4 sub-patterns already have existing matchers (LLAMA3, GPT2).

Changes

operators/tokenizer/bpe_utils.hpp

Added IsLM() helper — matches [\p{L}\p{M}]
Added NotLMNZ() helper — matches [^\s\p{L}\p{M}\p{N}]
Added Match_Qwen35_Pattern_1() — implements [^\r\n\p{L}\p{N}]?[\p{L}\p{M}]+
Added Match_Qwen35_Pattern_2() — implements ?[^\s\p{L}\p{M}\p{N}]+[\r\n]*
Registered both patterns in the Compile() lookup table (before shorter LLAMA3 patterns to avoid shadowing)

test/pp_api_test/test_tokenizer_impl.cc

Added Qwen35RegexTest — compiles the full Qwen3.5 regex and verifies tokenization of text with combining marks (e.g., café using U+0301), punctuation, digits, and newlines

Copilot

Pull request overview

Adds native pre-tokenizer support for Qwen3.5 regex alternatives that use Unicode mark (\p{M}), avoiding fallback to std::regex for unsupported Unicode property escapes.

Changes:

Added Qwen3.5-specific regex matcher functions and Unicode helper predicates.
Registered the new Qwen3.5 matcher patterns in PreTokenizerWithRegEx::Compile.
Added a regression test covering combining marks, punctuation, digits, whitespace, and newlines.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`operators/tokenizer/bpe_utils.hpp`	Adds Qwen3.5 regex matcher implementations and pattern registration.
`test/pp_api_test/test_tokenizer_impl.cc`	Adds a tokenizer regex test for Qwen3.5-style `\p{M}` handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sayanshaw24

looks great, thanks for adding this!

Add native \p{M} (Unicode Mark) regex support for Qwen3.5 tokenizer

27e861d

Copilot AI review requested due to automatic review settings May 15, 2026 15:57

apsonawane requested a review from a team as a code owner May 15, 2026 15:57

apsonawane mentioned this pull request May 15, 2026

Add text-only mode support for Qwen 3.5 model builder microsoft/onnxruntime-genai#2157

Open

Copilot started reviewing on behalf of apsonawane May 15, 2026 15:58 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

sayanshaw24 approved these changes May 15, 2026

View reviewed changes

sayanshaw24 enabled auto-merge (squash) May 15, 2026 16:20

sayanshaw24 merged commit f29716e into main May 15, 2026
42 checks passed

sayanshaw24 deleted the asonawane/processing branch May 15, 2026 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add native \p{M} (Unicode Mark) regex support for Qwen3.5 tokenizer#1063

Add native \p{M} (Unicode Mark) regex support for Qwen3.5 tokenizer#1063
sayanshaw24 merged 1 commit into
mainfrom
asonawane/processing

apsonawane commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

sayanshaw24 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

apsonawane commented May 15, 2026

Summary

Problem

Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

sayanshaw24 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants