Word counting with \w+ is inaccurate for non-English languages

**Description**
The current word-counting approach using a regex like `\w+` is not sufficient for non-English languages. It fails on many scripts (e.g., CJK languages, languages without clear word boundaries, or those using combining characters).

This is **not necessarily a request to change the current behavior**, but rather a note for future reference in case users report inaccuracies.

**Context / Rationale**

* `\w+` is heavily biased toward ASCII/Latin scripts.
* Even Unicode-aware regex approaches still break down across languages.
* The closest modern standard approach in JavaScript is `Intl.Segmenter`, which performs locale-aware text segmentation.

Reference:
[https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter)

**Notes / Caveats**

* `Intl.Segmenter` is better, but still not universally “correct” in all linguistic contexts.
* Any word-counting solution will involve tradeoffs depending on language, locale, and definition of “word.”

**Suggested Action**

* No immediate change required.
* Keep this issue as documentation / justification if word-count accuracy for non-English text becomes a concern in the future.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word counting with \w+ is inaccurate for non-English languages #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Word counting with \w+ is inaccurate for non-English languages #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions