fix(acceptance): CJK phrase boundaries and decode stop, so Japanese accepts advance clause by clause by FuJacob · Pull Request #669 · FuJacob/cotabby

FuJacob · 2026-06-11T04:41:23Z

Summary

A Japanese phrase accept could arrive as one giant chunk (reported tail: 理解し、その内容を自分の言葉で表現する。 in a single Tab) because every layer only knew ASCII punctuation: phrase mode never stopped at 。 or 、, a punctuation-led chunk skipped ICU segmentation and swallowed everything to the next whitespace, and SentenceBoundaryClassifier never registered a CJK sentence end so the decode stop policy let generations run to the full token budget. Phrase boundaries now include the CJK terminators (。！？｡) and clause commas (、，､), word chunks bind trailing CJK punctuation to their word (読み、 is one Tab), punctuation-led runs peel as their own chunk (including opening brackets, so flat quoted runs like 「分かった」と言った no longer swallow whole), and generation stops at the end of a Japanese sentence like it does for English. All added codepoints occur only in CJK text and ASCII ,/brackets stay non-boundaries, so space-delimited scripts are byte-for-byte unchanged.

Validation

xcodebuild build-for-testing ... CODE_SIGNING_ALLOWED=NO then test-without-building for the full unit bundle: ** TEST SUCCEEDED **, 993 tests, 0 failures (4 pre-existing skips). New cases cover the reported clause (理解し、 stops at the comma), 。/！/？ stops, the 終わり。」 closer walk-back, the leading-punctuation cliff, the opening-bracket peel (「分かった」と言った → 「), mixed closer-opener runs, halfwidth ､｡｣ parity, auto-accept-off re-peeling, and ASCII comma/bracket non-boundaries for English.
swiftlint lint --quiet on all changed files: exit 0, no findings.
Not yet validated end-to-end on device with a live Japanese session; the chunking rules are pure Support/ logic locked by the unit tests above.

Linked issues

None filed; follow-up to the Japanese IME report behind #668 (same user feedback thread, separate root cause).

Risk / rollout notes

Behavior change for CJK phrase-granularity users: Tab now advances clause by clause (、) and stops at 。, instead of accepting an entire space-less sentence per press. Word granularity gains the punctuation binding (資料、 is one chunk), loses the punctuation cliff, and peels opening brackets as their own chunk.
Generation for CJK text now early-stops at sentence ends via the existing decode stop policy, so Japanese/Chinese suggestions get shorter and cheaper at the source; English generation is unchanged.
Same pre-existing CI caveat as fix(insertion): IME-safe accept so Japanese/CJK suggestions land on Tab #668: main requires a cotabbyinference revision that may not be resolvable by CI until the package push lands.

Greptile Summary

This PR fixes CJK (Japanese/Chinese) text acceptance in phrase mode by threading phrase-boundary awareness through three layers: the decode stop policy, the phrase accumulator, and the word-chunk extractor. Before this change, a Japanese phrase could arrive as one giant Tab because every layer only knew ASCII punctuation.

SentenceBoundaryClassifier gains CJK sentence terminators (。！？｡) so the decode stop policy fires at the end of a Japanese sentence instead of running to the token budget, and the closing-punctuation walk-back now skips CJK brackets so 終わり。」 registers as a sentence end.
SuggestionSessionReconciler adds endOfCJKPunctuationRun to bind trailing 、 to the preceding word (読み、 is one chunk), a punctuation-led peel so opening brackets like 「 don't swallow flat quoted runs, and endsAtPhraseBoundary replaces endsInSentenceTerminator to stop the phrase accumulator at ideographic commas as well as terminators.
CJK primitives are consolidated into a single Character extension so a future codepoint addition updates all three policies in one edit.

Confidence Score: 5/5

Safe to merge; all changes are additive, CJK-only string-parsing logic with no impact on ASCII/space-delimited text, validated by 993 passing unit tests.

All changes are additive CJK-only string-parsing logic, ASCII/space-delimited paths are byte-for-byte unchanged and covered by existing tests, and every new code path has dedicated unit tests including edge cases and policy-flag interactions.

No files require special attention beyond the minor doc-comment omission in SuggestionEngineModels.swift.

Important Files Changed

Filename	Overview
Cotabby/Support/SuggestionSessionReconciler.swift	Core chunking and phrase-accumulation logic updated; adds endOfCJKPunctuationRun, trailing-binding, punctuation-led peel, and endsAtPhraseBoundary. Interaction between CJK binding and autoAcceptTrailingPunctuation=false is intentional and now tested.
Cotabby/Support/SentenceBoundaryClassifier.swift	Adds CJK sentence terminators to the decode-stop switch and extends isSentenceClosingPunctuation to include CJK closers; both changes are additive and well-scoped, with full test coverage.
Cotabby/Models/SuggestionEngineModels.swift	Doc-comment update only; no logic change.
CotabbyTests/SentenceBoundaryClassifierTests.swift	Adds 5 new tests for CJK terminators, closer walk-back, halfwidth parity, and ideographic comma non-termination; all appear correct.
CotabbyTests/SuggestionSessionReconcilerTests.swift	Adds 17 new tests covering all new chunking rules, phrase-boundary stops, halfwidth parity, ASCII non-regression, and autoAcceptTrailingPunctuation interactions; coverage looks thorough.

_{Reviews (3): Last reviewed commit: "Address Greptile review on #669: single-..." | Re-trigger Greptile}

A Japanese phrase accept arrived as one giant chunk: phrase mode only knew the ASCII terminators (. ! ?), so the ideographic full stop never ended a phrase and the clause comma was no boundary at all, and the same ASCII-only assumption in SentenceBoundaryClassifier meant the decode stop policy never fired for CJK text, so generations always ran to the token budget. A flat Japanese tail also had a punctuation cliff: a chunk that starts with CJK punctuation skips ICU word segmentation (punctuation does not begin a space-less-script word) and swallowed everything up to the next whitespace in a single accept, in word mode too. Phrase boundaries now include the CJK sentence terminators and treat the ideographic and fullwidth commas as clause stops, so Tab advances clause by clause the way Japanese prose reads. Word chunking binds a trailing CJK punctuation run to the word it follows and peels a punctuation-led run as its own chunk, removing the cliff. The classifier recognizes the CJK terminators (which, unlike the ASCII period, are unambiguous) and the CJK closing brackets for its walk-back, so generation stops at the end of a Japanese sentence like it does for English. All added codepoints occur only in CJK text and ASCII "," stays a non-boundary, so space-delimited scripts are byte-for-byte unchanged.

…unctuation Auditing the CJK chunking surfaced one remaining cliff: a chunk starting at a CJK opening bracket neither begins a space-less-script word nor binds to the preceding one, so a flat quoted run like the tail of 彼は「分かった」と言った was still swallowed whole by the whitespace scan. The punctuation-led peel now takes opening brackets too (the trailing binding still stops before them, since an opener belongs to the next word), and the halfwidth kana forms ､｣｢ join their fullwidth counterparts in the clause, closer, and opener sets, with the halfwidth corner also added to the classifier's closer walk. ASCII brackets and quotes are untouched by the peel (the sets stay CJK-only), locked in by regression tests alongside the opener, mixed-run, and halfwidth cases. Full unit bundle: 993 tests, 0 failures.

The CJK terminator and closer codepoint lists were restated in the reconciler's phrase policy and again in SentenceBoundaryClassifier's closer walk, so adding a codepoint required parallel edits with no compiler enforcement. The two primitive sets are now one internal Character extension in the reconciler file, and both the phrase policy and the classifier compose from them.

FuJacob · 2026-06-11T14:38:01Z

Addressed the review finding: the CJK terminator and closer sets are now declared once (internal Character extension) and composed by both the phrase policy and SentenceBoundaryClassifier, so a future codepoint is a single edit. Local validation on the rebased branch (top of merged #668): xcodegen no drift, swiftlint clean, build-for-testing succeeded, full unit bundle 1001 tests / 0 failures.

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread Cotabby/Support/SuggestionSessionReconciler.swift

FuJacob added 3 commits June 11, 2026 07:15

FuJacob force-pushed the fix/cjk-phrase-boundaries branch from 472af39 to 2ab9bb3 Compare June 11, 2026 14:38

FuJacob merged commit d36b29f into main Jun 11, 2026

FuJacob deleted the fix/cjk-phrase-boundaries branch June 11, 2026 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(acceptance): CJK phrase boundaries and decode stop, so Japanese accepts advance clause by clause#669

fix(acceptance): CJK phrase boundaries and decode stop, so Japanese accepts advance clause by clause#669
FuJacob merged 3 commits into
mainfrom
fix/cjk-phrase-boundaries

FuJacob commented Jun 11, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

FuJacob commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 11, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

FuJacob commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 11, 2026 •

edited by greptile-apps Bot

Loading