OpenClaw Version
N/A - code-trace bug, no runtime required.
Plugin Version
0.3.4 (current main, package.json)
Operating System
Any. The bug is in a JS regex and is OS/platform independent.
Describe the bug
UNSAFE_CHAR_RE at src/offload/storage.ts:165 includes the full surrogate range [\uD800-\uDFFF] but the regex has no u flag. Because JS strings are UTF-16, every non-BMP code point (emoji, CJK Extension B, math bold, etc.) is stored as a surrogate PAIR, and the regex strips each half independently. sanitizeText and sanitizeJsonLine therefore destroy any non-BMP character in tool params, tool results, and ref-md archives written by the offload pipeline.
To Reproduce
node -e '
const re = /[��-���-��-�\uD800-\uDFFF-
]/g;
console.log(JSON.stringify("CJK ext-B \u{20BB7} here".replace(re, "")));'
// prints: "CJK ext-B here" (the 𠮷 character is gone)
Same problem for 🎉 (U+1F389), 𝐀 (U+1D400, math bold A), and every other supplementary character.
Expected behavior
CJK Extension B 𠮷 and other non-BMP characters should pass through sanitizeText unchanged. The original intent of including [\uD800-\uDFFF] in the class is to strip LONE (malformed) surrogates, not to destroy well-formed supplementary characters.
Error Logs / Screenshots
Silent data corruption - there is no log, the characters just disappear from the offloaded JSONL entries and ref-md files.
Additional context
Affected callers (all in src/offload/):
sanitizeText: index.ts:429-430, 459-460 (every tool-call params/result)
sanitizeJsonLine: storage.ts:174 (every JSONL line written via safeStringifyEntry)
parseJsonlSafe: storage.ts:232 (second-pass strip on read)
Introduced in commit db8f3e5 (v0.3.3 release).
Suggested fix: add the u flag. With u, [\uD800-\uDFFF] matches only lone surrogates because paired surrogates have already been combined into a single code point.
OpenClaw Version
N/A - code-trace bug, no runtime required.
Plugin Version
0.3.4 (current main, package.json)
Operating System
Any. The bug is in a JS regex and is OS/platform independent.
Describe the bug
UNSAFE_CHAR_REatsrc/offload/storage.ts:165includes the full surrogate range[\uD800-\uDFFF]but the regex has nouflag. Because JS strings are UTF-16, every non-BMP code point (emoji, CJK Extension B, math bold, etc.) is stored as a surrogate PAIR, and the regex strips each half independently.sanitizeTextandsanitizeJsonLinetherefore destroy any non-BMP character in tool params, tool results, and ref-md archives written by the offload pipeline.To Reproduce
Same problem for 🎉 (U+1F389), 𝐀 (U+1D400, math bold A), and every other supplementary character.
Expected behavior
CJK Extension B 𠮷 and other non-BMP characters should pass through
sanitizeTextunchanged. The original intent of including[\uD800-\uDFFF]in the class is to strip LONE (malformed) surrogates, not to destroy well-formed supplementary characters.Error Logs / Screenshots
Silent data corruption - there is no log, the characters just disappear from the offloaded JSONL entries and ref-md files.
Additional context
Affected callers (all in
src/offload/):sanitizeText:index.ts:429-430, 459-460(every tool-call params/result)sanitizeJsonLine:storage.ts:174(every JSONL line written viasafeStringifyEntry)parseJsonlSafe:storage.ts:232(second-pass strip on read)Introduced in commit db8f3e5 (v0.3.3 release).
Suggested fix: add the
uflag. Withu,[\uD800-\uDFFF]matches only lone surrogates because paired surrogates have already been combined into a single code point.