fix(ansi/parser): keep UTF-8 string-state payloads intact (drop 8-bit C1 ST)#886
Open
ktat wants to merge 1 commit into
Open
fix(ansi/parser): keep UTF-8 string-state payloads intact (drop 8-bit C1 ST)#886ktat wants to merge 1 commit into
ktat wants to merge 1 commit into
Conversation
… C1 ST)
The Anywhere block of GenerateTransitionTable installs two entries in
every state, including OscStringState / DcsStringState /
SosStringState / PmStringState / ApcStringState:
- 0x9C -> ExecuteAction + GroundState (C1 ST as terminator)
- 0xC2..0xF4 -> CollectAction + Utf8State (start UTF-8 lead bytes)
For string-typed payloads carrying UTF-8 text (OSC 0 / OSC 2 window
titles, OSC 8 hyperlink IDs, DCS data, etc.) both of these are wrong:
- 0x9C is also a valid UTF-8 continuation byte — e.g. U+2733 (✳)
encodes as 0xE2 0x9C 0xB3, and U+672B (末) as 0xE6 0x9C 0xAB.
Treating the middle byte as ST splits the OSC / DCS string in half;
the tail then leaks out as ordinary text (the visible symptom is a
"session-name leak" into terminals that route OSC through this
parser, while xterm.js handles the same bytes correctly).
- 0xC2..0xF4 jumps into Utf8State, which completes the rune, performs
PrintAction, and returns to GroundState — breaking out of the
string mid-character.
Fix: re-register both as PutAction inside every string state via the
existing AddRange(0x20..0xFF, …, PutAction, …) (for SOS/PM/APC the
payload range is widened from 0x20..0x7F to 0x20..0xFF for the same
reason), and remove the AddOne(0x9C, …, DispatchAction, GroundState)
override in OSC / DCS / SOS / PM / APC.
The 8-bit C1 ST (0x9C) is intentionally dropped as a terminator. ESC \\
(7-bit ST) and BEL (for OSC) remain as the supported terminators —
modern UTF-8 terminals rely on the 7-bit form anyway. parser_decode.go's
StringState handler is updated to match (the `case ST:` branch is
removed).
## Tests
New (ansi/parser/transition_table_test.go) — exercises the fix directly
against the generated table:
- TestStringState_C1ST_NotDispatched: 0x9C is PutAction in all 5 string
states (Osc/Dcs/Sos/Pm/Apc).
- TestStringState_Utf8LeadBytes_StayInState: 0xC2 / 0xDF / 0xE0 / 0xEF
/ 0xF0 / 0xF4 stay in the same string state with PutAction.
- TestStringState_7BitTerminators: ESC still dispatches to EscapeState
(kicking off 7-bit ST), BEL still dispatches OSC.
NB: the previous attempt to demonstrate the fix from vt via
vt/osc_test.go was non-functional because vt's go.mod pins
github.com/charmbracelet/x/ansi (the published release, not the local
tree), so the test silently ran against the unfixed parser. That file
is removed and the verification is moved into the ansi tree where the
modified code is compiled directly.
## Compatibility-affecting test changes (8-bit C1 ST drop)
Updates / removals in existing ansi tests, all caused by the C1 ST
drop:
- parser_osc_test.go::TestOscSequence/utf8: trailing 0x9C → 7-bit ST
(ESC \\), expected gains Cmd('\\'). Also serves as a regression
test for U+2733-style 0x9C-in-payload survival.
- parser_osc_test.go::TestOscSequence/string_terminator: ditto
(expected payload now includes the full UTF-8 byte sequence
0xE6 0x9C 0xAB).
- parser_dcs_test.go::TestDcsSequence/reset and /parse: trailing
0x9C → 7-bit ST (ESC \\), expected gains Cmd('\\') to mirror the
existing /max_params test shape.
- parser_decode_test.go::TestDecodeSequence/ST_in_the_middle_of_DCS:
obsolete (8-bit ST embedded in DCS payload); removed with a
pointer to the new behaviour in a stub comment.
- parser_decode_test.go::TestDecodeSequence/set_background_OSC_sequence_with_ST_8-bit_terminator:
obsolete; removed with the same kind of stub.
- width_test.go::cases osc and osc8eastasianlink: trailing 0x9C →
7-bit ST (ESC \\). The semantic remains the same.
- truncate_test.go::cases osc8_8bit: same 0x9C → 7-bit ST update on
input and both expected sequences.
Callers of this parser that emitted 8-bit OSC / DCS termination need
to switch to ESC \\ (or BEL for OSC). For UTF-8-only environments
(the practical default) this is a strict improvement; for legacy
clients still depending on 8-bit C1 ST the right shape is probably an
opt-in mode wired through the parser — that is intentionally out of
scope here and worth taking up with upstream.
## Known limitation (design pointer)
The root cause is that UTF-8 multi-byte payloads inside string states
are decoded as bare bytes before any UTF-8 reassembly. This patch
plugs the leak by widening the PutAction range to 0x20..0xFF in every
string state, but a more structurally correct fix would route the
≥0x80 payload bytes through Utf8State (or an equivalent decoder)
*while remaining inside the string state* — xterm.js does this. That
refactor is left for a follow-up.
Author
|
Note: I became aware of #848 after opening this PR. The approach proposed there — tracking UTF-8 continuation state so that 0x9C is treated as ST only when it is not part of an active multi-byte sequence — is more correct than this PR, which simply drops 8-bit C1 ST as a terminator entirely. If a fix along the lines of #848 lands, this PR should be closed in favor of that. Happy to close it now if the maintainers prefer to wait for the more complete fix. |
Author
|
Also, the "Known limitation" section in the commit message contains two inaccuracies:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
Anywheretransition block installs two entries in every state,including the string states (OSC / DCS / SOS / PM / APC):
0x9C→ExecuteAction + GroundState(C1 ST as terminator)0xC2..0xF4→CollectAction + Utf8State(UTF-8 lead bytes)Both are wrong for UTF-8 string payloads.
0x9Cis a valid UTF-8continuation byte — U+2733 (✳) is
0xE2 0x9C 0xB3, U+672B (末) is0xE6 0x9C 0xAB— so treating it as ST splits the string mid-characterand leaks the tail as visible text. The
0xC2..0xF4entries jump toUtf8Statewhich returns toGroundState, abandoning the string stateentirely.
Fix
Each string state's
AddRange(0x20..0xFF, PutAction, …)alreadyoverwrites the Anywhere entries — this PR ensures that order is
preserved and removes the explicit
AddOne(0x9C, DispatchAction, …)overrides in OSC, DCS, and SOS/PM/APC. For SOS/PM/APC the payload
range is widened from
0x20..0x7Fto0x20..0xFF.The 8-bit C1 ST (
0x9C) is intentionally dropped as a terminator.ESC
\(7-bit ST) and BEL (OSC only) remain supported.Tests
New
ansi/parser/transition_table_test.goverifies directly againstthe generated table that
0x9Cand0xC2..0xF4arePutActioninall five string states, and that 7-bit terminators still fire.
Existing tests updated to replace
0x9Cterminators withESC \.Two test cases that relied on 8-bit ST termination are removed.