Skip to content

fix(ansi/parser): keep UTF-8 string-state payloads intact (drop 8-bit C1 ST)#886

Open
ktat wants to merge 1 commit into
charmbracelet:mainfrom
ktat:fix-osc-utf8-c1-st-conflict
Open

fix(ansi/parser): keep UTF-8 string-state payloads intact (drop 8-bit C1 ST)#886
ktat wants to merge 1 commit into
charmbracelet:mainfrom
ktat:fix-osc-utf8-c1-st-conflict

Conversation

@ktat

@ktat ktat commented Jun 4, 2026

Copy link
Copy Markdown

Problem

The Anywhere transition block installs two entries in every state,
including the string states (OSC / DCS / SOS / PM / APC):

  • 0x9CExecuteAction + GroundState (C1 ST as terminator)
  • 0xC2..0xF4CollectAction + Utf8State (UTF-8 lead bytes)

Both are wrong for UTF-8 string payloads. 0x9C is a valid UTF-8
continuation byte — U+2733 (✳) is 0xE2 0x9C 0xB3, U+672B (末) is
0xE6 0x9C 0xAB — so treating it as ST splits the string mid-character
and leaks the tail as visible text. The 0xC2..0xF4 entries jump to
Utf8State which returns to GroundState, abandoning the string state
entirely.

Fix

Each string state's AddRange(0x20..0xFF, PutAction, …) already
overwrites the Anywhere entries — this PR ensures that order is
preserved and removes the explicit AddOne(0x9C, DispatchAction, …)
overrides in OSC, DCS, and SOS/PM/APC. For SOS/PM/APC the payload
range is widened from 0x20..0x7F to 0x20..0xFF.

The 8-bit C1 ST (0x9C) is intentionally dropped as a terminator.
ESC \ (7-bit ST) and BEL (OSC only) remain supported.

Tests

New ansi/parser/transition_table_test.go verifies directly against
the generated table that 0x9C and 0xC2..0xF4 are PutAction in
all five string states, and that 7-bit terminators still fire.

Existing tests updated to replace 0x9C terminators with ESC \.
Two test cases that relied on 8-bit ST termination are removed.

… C1 ST)

The Anywhere block of GenerateTransitionTable installs two entries in
every state, including OscStringState / DcsStringState /
SosStringState / PmStringState / ApcStringState:

  - 0x9C       -> ExecuteAction + GroundState  (C1 ST as terminator)
  - 0xC2..0xF4 -> CollectAction + Utf8State    (start UTF-8 lead bytes)

For string-typed payloads carrying UTF-8 text (OSC 0 / OSC 2 window
titles, OSC 8 hyperlink IDs, DCS data, etc.) both of these are wrong:

  - 0x9C is also a valid UTF-8 continuation byte — e.g. U+2733 (✳)
    encodes as 0xE2 0x9C 0xB3, and U+672B (末) as 0xE6 0x9C 0xAB.
    Treating the middle byte as ST splits the OSC / DCS string in half;
    the tail then leaks out as ordinary text (the visible symptom is a
    "session-name leak" into terminals that route OSC through this
    parser, while xterm.js handles the same bytes correctly).
  - 0xC2..0xF4 jumps into Utf8State, which completes the rune, performs
    PrintAction, and returns to GroundState — breaking out of the
    string mid-character.

Fix: re-register both as PutAction inside every string state via the
existing AddRange(0x20..0xFF, …, PutAction, …) (for SOS/PM/APC the
payload range is widened from 0x20..0x7F to 0x20..0xFF for the same
reason), and remove the AddOne(0x9C, …, DispatchAction, GroundState)
override in OSC / DCS / SOS / PM / APC.

The 8-bit C1 ST (0x9C) is intentionally dropped as a terminator. ESC \\
(7-bit ST) and BEL (for OSC) remain as the supported terminators —
modern UTF-8 terminals rely on the 7-bit form anyway. parser_decode.go's
StringState handler is updated to match (the `case ST:` branch is
removed).

## Tests

New (ansi/parser/transition_table_test.go) — exercises the fix directly
against the generated table:

  - TestStringState_C1ST_NotDispatched: 0x9C is PutAction in all 5 string
    states (Osc/Dcs/Sos/Pm/Apc).
  - TestStringState_Utf8LeadBytes_StayInState: 0xC2 / 0xDF / 0xE0 / 0xEF
    / 0xF0 / 0xF4 stay in the same string state with PutAction.
  - TestStringState_7BitTerminators: ESC still dispatches to EscapeState
    (kicking off 7-bit ST), BEL still dispatches OSC.

NB: the previous attempt to demonstrate the fix from vt via
vt/osc_test.go was non-functional because vt's go.mod pins
github.com/charmbracelet/x/ansi (the published release, not the local
tree), so the test silently ran against the unfixed parser. That file
is removed and the verification is moved into the ansi tree where the
modified code is compiled directly.

## Compatibility-affecting test changes (8-bit C1 ST drop)

Updates / removals in existing ansi tests, all caused by the C1 ST
drop:

  - parser_osc_test.go::TestOscSequence/utf8: trailing 0x9C → 7-bit ST
    (ESC \\), expected gains Cmd('\\'). Also serves as a regression
    test for U+2733-style 0x9C-in-payload survival.
  - parser_osc_test.go::TestOscSequence/string_terminator: ditto
    (expected payload now includes the full UTF-8 byte sequence
    0xE6 0x9C 0xAB).
  - parser_dcs_test.go::TestDcsSequence/reset and /parse: trailing
    0x9C → 7-bit ST (ESC \\), expected gains Cmd('\\') to mirror the
    existing /max_params test shape.
  - parser_decode_test.go::TestDecodeSequence/ST_in_the_middle_of_DCS:
    obsolete (8-bit ST embedded in DCS payload); removed with a
    pointer to the new behaviour in a stub comment.
  - parser_decode_test.go::TestDecodeSequence/set_background_OSC_sequence_with_ST_8-bit_terminator:
    obsolete; removed with the same kind of stub.
  - width_test.go::cases osc and osc8eastasianlink: trailing 0x9C →
    7-bit ST (ESC \\). The semantic remains the same.
  - truncate_test.go::cases osc8_8bit: same 0x9C → 7-bit ST update on
    input and both expected sequences.

Callers of this parser that emitted 8-bit OSC / DCS termination need
to switch to ESC \\ (or BEL for OSC). For UTF-8-only environments
(the practical default) this is a strict improvement; for legacy
clients still depending on 8-bit C1 ST the right shape is probably an
opt-in mode wired through the parser — that is intentionally out of
scope here and worth taking up with upstream.

## Known limitation (design pointer)

The root cause is that UTF-8 multi-byte payloads inside string states
are decoded as bare bytes before any UTF-8 reassembly. This patch
plugs the leak by widening the PutAction range to 0x20..0xFF in every
string state, but a more structurally correct fix would route the
≥0x80 payload bytes through Utf8State (or an equivalent decoder)
*while remaining inside the string state* — xterm.js does this. That
refactor is left for a follow-up.
@ktat

ktat commented Jun 4, 2026

Copy link
Copy Markdown
Author

Note: I became aware of #848 after opening this PR. The approach proposed there — tracking UTF-8 continuation state so that 0x9C is treated as ST only when it is not part of an active multi-byte sequence — is more correct than this PR, which simply drops 8-bit C1 ST as a terminator entirely.

If a fix along the lines of #848 lands, this PR should be closed in favor of that. Happy to close it now if the maintainers prefer to wait for the more complete fix.

@ktat

ktat commented Jun 4, 2026

Copy link
Copy Markdown
Author

Also, the "Known limitation" section in the commit message contains two inaccuracies:

  • "xterm.js does this" — xterm.js does not have this problem in the first place because it operates on decoded Unicode codepoints (JavaScript strings are UTF-16), not raw bytes. There is no UTF-8 reassembly to do.
  • "route the ≥0x80 payload bytes through Utf8State" — the approach proposed in ansi: OSC/DCS/SOS/PM/APC strings truncated by UTF-8 continuation byte 0x9C #848 is different: keep the state machine as-is and track a UTF-8 continuation byte counter outside it, so that 0x9C is treated as ST only when no multi-byte sequence is in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant