fix(acceptance): CJK phrase boundaries and decode stop, so Japanese accepts advance clause by clause#669
Merged
Merged
Conversation
A Japanese phrase accept arrived as one giant chunk: phrase mode only knew the ASCII terminators (. ! ?), so the ideographic full stop never ended a phrase and the clause comma was no boundary at all, and the same ASCII-only assumption in SentenceBoundaryClassifier meant the decode stop policy never fired for CJK text, so generations always ran to the token budget. A flat Japanese tail also had a punctuation cliff: a chunk that starts with CJK punctuation skips ICU word segmentation (punctuation does not begin a space-less-script word) and swallowed everything up to the next whitespace in a single accept, in word mode too. Phrase boundaries now include the CJK sentence terminators and treat the ideographic and fullwidth commas as clause stops, so Tab advances clause by clause the way Japanese prose reads. Word chunking binds a trailing CJK punctuation run to the word it follows and peels a punctuation-led run as its own chunk, removing the cliff. The classifier recognizes the CJK terminators (which, unlike the ASCII period, are unambiguous) and the CJK closing brackets for its walk-back, so generation stops at the end of a Japanese sentence like it does for English. All added codepoints occur only in CJK text and ASCII "," stays a non-boundary, so space-delimited scripts are byte-for-byte unchanged.
…unctuation Auditing the CJK chunking surfaced one remaining cliff: a chunk starting at a CJK opening bracket neither begins a space-less-script word nor binds to the preceding one, so a flat quoted run like the tail of 彼は「分かった」と言った was still swallowed whole by the whitespace scan. The punctuation-led peel now takes opening brackets too (the trailing binding still stops before them, since an opener belongs to the next word), and the halfwidth kana forms 、 」 「 join their fullwidth counterparts in the clause, closer, and opener sets, with the halfwidth corner also added to the classifier's closer walk. ASCII brackets and quotes are untouched by the peel (the sets stay CJK-only), locked in by regression tests alongside the opener, mixed-run, and halfwidth cases. Full unit bundle: 993 tests, 0 failures.
The CJK terminator and closer codepoint lists were restated in the reconciler's phrase policy and again in SentenceBoundaryClassifier's closer walk, so adding a codepoint required parallel edits with no compiler enforcement. The two primitive sets are now one internal Character extension in the reconciler file, and both the phrase policy and the classifier compose from them.
Owner
Author
|
Addressed the review finding: the CJK terminator and closer sets are now declared once (internal Character extension) and composed by both the phrase policy and SentenceBoundaryClassifier, so a future codepoint is a single edit. Local validation on the rebased branch (top of merged #668): xcodegen no drift, swiftlint clean, build-for-testing succeeded, full unit bundle 1001 tests / 0 failures. |
472af39 to
2ab9bb3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A Japanese phrase accept could arrive as one giant chunk (reported tail:
理解し、その内容を自分の言葉で表現する。in a single Tab) because every layer only knew ASCII punctuation: phrase mode never stopped at。or、, a punctuation-led chunk skipped ICU segmentation and swallowed everything to the next whitespace, andSentenceBoundaryClassifiernever registered a CJK sentence end so the decode stop policy let generations run to the full token budget. Phrase boundaries now include the CJK terminators (。!?。) and clause commas (、,、), word chunks bind trailing CJK punctuation to their word (読み、is one Tab), punctuation-led runs peel as their own chunk (including opening brackets, so flat quoted runs like「分かった」と言ったno longer swallow whole), and generation stops at the end of a Japanese sentence like it does for English. All added codepoints occur only in CJK text and ASCII,/brackets stay non-boundaries, so space-delimited scripts are byte-for-byte unchanged.Validation
xcodebuild build-for-testing ... CODE_SIGNING_ALLOWED=NOthentest-without-buildingfor the full unit bundle: ** TEST SUCCEEDED **, 993 tests, 0 failures (4 pre-existing skips). New cases cover the reported clause (理解し、stops at the comma),。/!/?stops, the終わり。」closer walk-back, the leading-punctuation cliff, the opening-bracket peel (「分かった」と言った→「), mixed closer-opener runs, halfwidth、 。 」parity, auto-accept-off re-peeling, and ASCII comma/bracket non-boundaries for English.swiftlint lint --quieton all changed files: exit 0, no findings.Support/logic locked by the unit tests above.Linked issues
None filed; follow-up to the Japanese IME report behind #668 (same user feedback thread, separate root cause).
Risk / rollout notes
、) and stops at。, instead of accepting an entire space-less sentence per press. Word granularity gains the punctuation binding (資料、is one chunk), loses the punctuation cliff, and peels opening brackets as their own chunk.cotabbyinferencerevision that may not be resolvable by CI until the package push lands.Greptile Summary
This PR fixes CJK (Japanese/Chinese) text acceptance in phrase mode by threading phrase-boundary awareness through three layers: the decode stop policy, the phrase accumulator, and the word-chunk extractor. Before this change, a Japanese phrase could arrive as one giant Tab because every layer only knew ASCII punctuation.
SentenceBoundaryClassifiergains CJK sentence terminators (。!?。) so the decode stop policy fires at the end of a Japanese sentence instead of running to the token budget, and the closing-punctuation walk-back now skips CJK brackets so終わり。」registers as a sentence end.SuggestionSessionReconcileraddsendOfCJKPunctuationRunto bind trailing、to the preceding word (読み、is one chunk), a punctuation-led peel so opening brackets like「don't swallow flat quoted runs, andendsAtPhraseBoundaryreplacesendsInSentenceTerminatorto stop the phrase accumulator at ideographic commas as well as terminators.Characterextension so a future codepoint addition updates all three policies in one edit.Confidence Score: 5/5
Safe to merge; all changes are additive, CJK-only string-parsing logic with no impact on ASCII/space-delimited text, validated by 993 passing unit tests.
All changes are additive CJK-only string-parsing logic, ASCII/space-delimited paths are byte-for-byte unchanged and covered by existing tests, and every new code path has dedicated unit tests including edge cases and policy-flag interactions.
No files require special attention beyond the minor doc-comment omission in SuggestionEngineModels.swift.
Important Files Changed
Reviews (3): Last reviewed commit: "Address Greptile review on #669: single-..." | Re-trigger Greptile