Skip to content

[Bug] sanitizeText silently strips emoji and CJK Extension B characters #30

@akhilesharora

Description

@akhilesharora

OpenClaw Version

N/A - code-trace bug, no runtime required.

Plugin Version

0.3.4 (current main, package.json)

Operating System

Any. The bug is in a JS regex and is OS/platform independent.

Describe the bug

UNSAFE_CHAR_RE at src/offload/storage.ts:165 includes the full surrogate range [\uD800-\uDFFF] but the regex has no u flag. Because JS strings are UTF-16, every non-BMP code point (emoji, CJK Extension B, math bold, etc.) is stored as a surrogate PAIR, and the regex strips each half independently. sanitizeText and sanitizeJsonLine therefore destroy any non-BMP character in tool params, tool results, and ref-md archives written by the offload pipeline.

To Reproduce

node -e '
const re = /[��-���-��-�\uD800-\uDFFF​-‏

]/g;
console.log(JSON.stringify("CJK ext-B \u{20BB7} here".replace(re, "")));'
// prints: "CJK ext-B  here"   (the 𠮷 character is gone)

Same problem for 🎉 (U+1F389), 𝐀 (U+1D400, math bold A), and every other supplementary character.

Expected behavior

CJK Extension B 𠮷 and other non-BMP characters should pass through sanitizeText unchanged. The original intent of including [\uD800-\uDFFF] in the class is to strip LONE (malformed) surrogates, not to destroy well-formed supplementary characters.

Error Logs / Screenshots

Silent data corruption - there is no log, the characters just disappear from the offloaded JSONL entries and ref-md files.

Additional context

Affected callers (all in src/offload/):

  • sanitizeText: index.ts:429-430, 459-460 (every tool-call params/result)
  • sanitizeJsonLine: storage.ts:174 (every JSONL line written via safeStringifyEntry)
  • parseJsonlSafe: storage.ts:232 (second-pass strip on read)

Introduced in commit db8f3e5 (v0.3.3 release).

Suggested fix: add the u flag. With u, [\uD800-\uDFFF] matches only lone surrogates because paired surrogates have already been combined into a single code point.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions