fix(ansi/parser): keep UTF-8 string-state payloads intact (drop 8-bit C1 ST) by ktat · Pull Request #886 · charmbracelet/x

ktat · 2026-06-04T11:06:44Z

Problem

The Anywhere transition block installs two entries in every state,
including the string states (OSC / DCS / SOS / PM / APC):

0x9C → ExecuteAction + GroundState (C1 ST as terminator)
0xC2..0xF4 → CollectAction + Utf8State (UTF-8 lead bytes)

Both are wrong for UTF-8 string payloads. 0x9C is a valid UTF-8
continuation byte — U+2733 (✳) is 0xE2 0x9C 0xB3, U+672B (末) is
0xE6 0x9C 0xAB — so treating it as ST splits the string mid-character
and leaks the tail as visible text. The 0xC2..0xF4 entries jump to
Utf8State which returns to GroundState, abandoning the string state
entirely.

Fix

Each string state's AddRange(0x20..0xFF, PutAction, …) already
overwrites the Anywhere entries — this PR ensures that order is
preserved and removes the explicit AddOne(0x9C, DispatchAction, …)
overrides in OSC, DCS, and SOS/PM/APC. For SOS/PM/APC the payload
range is widened from 0x20..0x7F to 0x20..0xFF.

The 8-bit C1 ST (0x9C) is intentionally dropped as a terminator.
ESC \ (7-bit ST) and BEL (OSC only) remain supported.

Tests

New ansi/parser/transition_table_test.go verifies directly against
the generated table that 0x9C and 0xC2..0xF4 are PutAction in
all five string states, and that 7-bit terminators still fire.

Existing tests updated to replace 0x9C terminators with ESC \.
Two test cases that relied on 8-bit ST termination are removed.

… C1 ST) The Anywhere block of GenerateTransitionTable installs two entries in every state, including OscStringState / DcsStringState / SosStringState / PmStringState / ApcStringState: - 0x9C -> ExecuteAction + GroundState (C1 ST as terminator) - 0xC2..0xF4 -> CollectAction + Utf8State (start UTF-8 lead bytes) For string-typed payloads carrying UTF-8 text (OSC 0 / OSC 2 window titles, OSC 8 hyperlink IDs, DCS data, etc.) both of these are wrong: - 0x9C is also a valid UTF-8 continuation byte — e.g. U+2733 (✳) encodes as 0xE2 0x9C 0xB3, and U+672B (末) as 0xE6 0x9C 0xAB. Treating the middle byte as ST splits the OSC / DCS string in half; the tail then leaks out as ordinary text (the visible symptom is a "session-name leak" into terminals that route OSC through this parser, while xterm.js handles the same bytes correctly). - 0xC2..0xF4 jumps into Utf8State, which completes the rune, performs PrintAction, and returns to GroundState — breaking out of the string mid-character. Fix: re-register both as PutAction inside every string state via the existing AddRange(0x20..0xFF, …, PutAction, …) (for SOS/PM/APC the payload range is widened from 0x20..0x7F to 0x20..0xFF for the same reason), and remove the AddOne(0x9C, …, DispatchAction, GroundState) override in OSC / DCS / SOS / PM / APC. The 8-bit C1 ST (0x9C) is intentionally dropped as a terminator. ESC \\ (7-bit ST) and BEL (for OSC) remain as the supported terminators — modern UTF-8 terminals rely on the 7-bit form anyway. parser_decode.go's StringState handler is updated to match (the `case ST:` branch is removed). ## Tests New (ansi/parser/transition_table_test.go) — exercises the fix directly against the generated table: - TestStringState_C1ST_NotDispatched: 0x9C is PutAction in all 5 string states (Osc/Dcs/Sos/Pm/Apc). - TestStringState_Utf8LeadBytes_StayInState: 0xC2 / 0xDF / 0xE0 / 0xEF / 0xF0 / 0xF4 stay in the same string state with PutAction. - TestStringState_7BitTerminators: ESC still dispatches to EscapeState (kicking off 7-bit ST), BEL still dispatches OSC. NB: the previous attempt to demonstrate the fix from vt via vt/osc_test.go was non-functional because vt's go.mod pins github.com/charmbracelet/x/ansi (the published release, not the local tree), so the test silently ran against the unfixed parser. That file is removed and the verification is moved into the ansi tree where the modified code is compiled directly. ## Compatibility-affecting test changes (8-bit C1 ST drop) Updates / removals in existing ansi tests, all caused by the C1 ST drop: - parser_osc_test.go::TestOscSequence/utf8: trailing 0x9C → 7-bit ST (ESC \\), expected gains Cmd('\\'). Also serves as a regression test for U+2733-style 0x9C-in-payload survival. - parser_osc_test.go::TestOscSequence/string_terminator: ditto (expected payload now includes the full UTF-8 byte sequence 0xE6 0x9C 0xAB). - parser_dcs_test.go::TestDcsSequence/reset and /parse: trailing 0x9C → 7-bit ST (ESC \\), expected gains Cmd('\\') to mirror the existing /max_params test shape. - parser_decode_test.go::TestDecodeSequence/ST_in_the_middle_of_DCS: obsolete (8-bit ST embedded in DCS payload); removed with a pointer to the new behaviour in a stub comment. - parser_decode_test.go::TestDecodeSequence/set_background_OSC_sequence_with_ST_8-bit_terminator: obsolete; removed with the same kind of stub. - width_test.go::cases osc and osc8eastasianlink: trailing 0x9C → 7-bit ST (ESC \\). The semantic remains the same. - truncate_test.go::cases osc8_8bit: same 0x9C → 7-bit ST update on input and both expected sequences. Callers of this parser that emitted 8-bit OSC / DCS termination need to switch to ESC \\ (or BEL for OSC). For UTF-8-only environments (the practical default) this is a strict improvement; for legacy clients still depending on 8-bit C1 ST the right shape is probably an opt-in mode wired through the parser — that is intentionally out of scope here and worth taking up with upstream. ## Known limitation (design pointer) The root cause is that UTF-8 multi-byte payloads inside string states are decoded as bare bytes before any UTF-8 reassembly. This patch plugs the leak by widening the PutAction range to 0x20..0xFF in every string state, but a more structurally correct fix would route the ≥0x80 payload bytes through Utf8State (or an equivalent decoder) *while remaining inside the string state* — xterm.js does this. That refactor is left for a follow-up.

ktat · 2026-06-04T11:18:26Z

Note: I became aware of #848 after opening this PR. The approach proposed there — tracking UTF-8 continuation state so that 0x9C is treated as ST only when it is not part of an active multi-byte sequence — is more correct than this PR, which simply drops 8-bit C1 ST as a terminator entirely.

If a fix along the lines of #848 lands, this PR should be closed in favor of that. Happy to close it now if the maintainers prefer to wait for the more complete fix.

ktat · 2026-06-04T11:26:20Z

Also, the "Known limitation" section in the commit message contains two inaccuracies:

"xterm.js does this" — xterm.js does not have this problem in the first place because it operates on decoded Unicode codepoints (JavaScript strings are UTF-16), not raw bytes. There is no UTF-8 reassembly to do.
"route the ≥0x80 payload bytes through Utf8State" — the approach proposed in ansi: OSC/DCS/SOS/PM/APC strings truncated by UTF-8 continuation byte 0x9C #848 is different: keep the state machine as-is and track a UTF-8 continuation byte counter outside it, so that 0x9C is treated as ST only when no multi-byte sequence is in progress.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ansi/parser): keep UTF-8 string-state payloads intact (drop 8-bit C1 ST)#886

fix(ansi/parser): keep UTF-8 string-state payloads intact (drop 8-bit C1 ST)#886
ktat wants to merge 1 commit into
charmbracelet:mainfrom
ktat:fix-osc-utf8-c1-st-conflict

ktat commented Jun 4, 2026

Uh oh!

ktat commented Jun 4, 2026

Uh oh!

ktat commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ktat commented Jun 4, 2026

Problem

Fix

Tests

Uh oh!

ktat commented Jun 4, 2026

Uh oh!

ktat commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant