fix(offload): preserve non-BMP characters in sanitizeText by akhilesharora · Pull Request #31 · Tencent/TencentDB-Agent-Memory

akhilesharora · 2026-05-16T20:24:24Z

UNSAFE_CHAR_RE at src/offload/storage.ts:165 includes the full surrogate range [\uD800-\uDFFF] but is not declared with the u flag. Without u, JS regexes treat strings as UTF-16 code units, so every well-formed non-BMP code point (emoji, CJK Extension B, math bold, etc.) is a surrogate pair whose halves get stripped individually. sanitizeText, sanitizeJsonLine, and parseJsonlSafe therefore silently destroy these characters in tool params, tool results, and ref-md archives.

Reproduction with the exact regex copied from the file:

$ node -e '
const re = /[��-���-��-�\uD800-\uDFFF-‏  ]/g;
console.log(JSON.stringify("CJK ext-B \u{20BB7} here".replace(re, "")));'
"CJK ext-B  here"

The 𠮷 (U+20BB7) is gone. Same problem for 🎉, 𝐀, and every other supplementary character.

Fix: add the u flag. With u, the surrogate range only matches lone (malformed) surrogates, which is the original intent. Well-formed pairs are treated as single code points and pass through.

Other entries in the class (replacement char, C0/C1 controls, zero-width chars, line separators, BOM) keep their behavior.

Added src/offload/storage.test.ts with six cases:

plain ASCII survives
emoji, CJK Extension B, math bold survive
lone surrogates still stripped
C0 and C1 controls still stripped
zero-width and BOM still stripped
non-string input passes through (existing typeof guard)

On main the non-BMP test fails (expected 'emoji here' to be 'emoji 🎉 here'); with the fix all six pass.

Files touched:

src/offload/storage.ts: one-character change (/g -> /gu).
src/offload/storage.test.ts: new file, no runtime/network/db dependency.

Closes #30

UNSAFE_CHAR_RE included the surrogate range [\uD800-\uDFFF] without the `u` flag, so JS treated strings as UTF-16 code units and stripped each half of every well-formed non-BMP code point. sanitizeText and sanitizeJsonLine therefore destroyed emoji, CJK Extension B, math bold, etc. in tool params, tool results, and ref-md archives. Adding the `u` flag makes paired surrogates combine into a single code point before matching, so the [\uD800-\uDFFF] entry now matches only lone (malformed) surrogates, which is the original intent. All other entries (replacement char, C0/C1 controls, zero-width chars, line separators, BOM) keep their behavior. Added a vitest suite covering the preserved and stripped cases. Closes Tencent#30

Maxwell-Code07 · 2026-05-17T11:00:41Z

Hi @akhilesharora, thanks for the fix!

This directly addresses #30 — adding the u flag is the correct minimal fix, and the test coverage (emoji, CJK Extension B, isolated surrogates, control chars) is thorough. We'll review and merge.

Thanks! 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(offload): preserve non-BMP characters in sanitizeText#31

fix(offload): preserve non-BMP characters in sanitizeText#31
akhilesharora wants to merge 1 commit into
Tencent:mainfrom
akhilesharora:fix/sanitize-text-preserve-non-bmp

akhilesharora commented May 16, 2026

Uh oh!

Maxwell-Code07 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akhilesharora commented May 16, 2026

Uh oh!

Maxwell-Code07 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants