-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Description
The current word-counting approach using a regex like \w+ is not sufficient for non-English languages. It fails on many scripts (e.g., CJK languages, languages without clear word boundaries, or those using combining characters).
This is not necessarily a request to change the current behavior, but rather a note for future reference in case users report inaccuracies.
Context / Rationale
\w+is heavily biased toward ASCII/Latin scripts.- Even Unicode-aware regex approaches still break down across languages.
- The closest modern standard approach in JavaScript is
Intl.Segmenter, which performs locale-aware text segmentation.
Reference:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter
Notes / Caveats
Intl.Segmenteris better, but still not universally “correct” in all linguistic contexts.- Any word-counting solution will involve tradeoffs depending on language, locale, and definition of “word.”
Suggested Action
- No immediate change required.
- Keep this issue as documentation / justification if word-count accuracy for non-English text becomes a concern in the future.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels