Skip to content

fix(offload): preserve non-BMP characters in sanitizeText#31

Open
akhilesharora wants to merge 1 commit into
Tencent:mainfrom
akhilesharora:fix/sanitize-text-preserve-non-bmp
Open

fix(offload): preserve non-BMP characters in sanitizeText#31
akhilesharora wants to merge 1 commit into
Tencent:mainfrom
akhilesharora:fix/sanitize-text-preserve-non-bmp

Conversation

@akhilesharora
Copy link
Copy Markdown

UNSAFE_CHAR_RE at src/offload/storage.ts:165 includes the full surrogate range [\uD800-\uDFFF] but is not declared with the u flag. Without u, JS regexes treat strings as UTF-16 code units, so every well-formed non-BMP code point (emoji, CJK Extension B, math bold, etc.) is a surrogate pair whose halves get stripped individually. sanitizeText, sanitizeJsonLine, and parseJsonlSafe therefore silently destroy these characters in tool params, tool results, and ref-md archives.

Reproduction with the exact regex copied from the file:

$ node -e '
const re = /[��-���-��-�\uD800-\uDFFF​-‏

]/g;
console.log(JSON.stringify("CJK ext-B \u{20BB7} here".replace(re, "")));'
"CJK ext-B  here"

The 𠮷 (U+20BB7) is gone. Same problem for 🎉, 𝐀, and every other supplementary character.

Fix: add the u flag. With u, the surrogate range only matches lone (malformed) surrogates, which is the original intent. Well-formed pairs are treated as single code points and pass through.

Other entries in the class (replacement char, C0/C1 controls, zero-width chars, line separators, BOM) keep their behavior.

Added src/offload/storage.test.ts with six cases:

  • plain ASCII survives
  • emoji, CJK Extension B, math bold survive
  • lone surrogates still stripped
  • C0 and C1 controls still stripped
  • zero-width and BOM still stripped
  • non-string input passes through (existing typeof guard)

On main the non-BMP test fails (expected 'emoji here' to be 'emoji 🎉 here'); with the fix all six pass.

Files touched:

  • src/offload/storage.ts: one-character change (/g -> /gu).
  • src/offload/storage.test.ts: new file, no runtime/network/db dependency.

Closes #30

UNSAFE_CHAR_RE included the surrogate range [\uD800-\uDFFF] without the `u`
flag, so JS treated strings as UTF-16 code units and stripped each half of
every well-formed non-BMP code point. sanitizeText and sanitizeJsonLine
therefore destroyed emoji, CJK Extension B, math bold, etc. in tool params,
tool results, and ref-md archives.

Adding the `u` flag makes paired surrogates combine into a single code point
before matching, so the [\uD800-\uDFFF] entry now matches only lone (malformed)
surrogates, which is the original intent. All other entries (replacement char,
C0/C1 controls, zero-width chars, line separators, BOM) keep their behavior.

Added a vitest suite covering the preserved and stripped cases.

Closes Tencent#30
@Maxwell-Code07
Copy link
Copy Markdown
Collaborator

Hi @akhilesharora, thanks for the fix!

This directly addresses #30 — adding the u flag is the correct minimal fix, and the test coverage (emoji, CJK Extension B, isolated surrogates, control chars) is thorough. We'll review and merge.

Thanks! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] sanitizeText silently strips emoji and CJK Extension B characters

2 participants