experiment using unicode decomposition & regex char ranges by missinglink · Pull Request #45 · tyxla/remove-accents

missinglink · 2023-07-18T18:54:56Z

DRAFT: this is not really intended for merging but instead as a discussion point regarding how we might be able to identify accents without manually enumerating them all.

The tl;dr is:

string
    .normalize('NFKD')
    .replace(COMBINING_MARKS, '')
    .normalize('NFKC')

There is a description of the method in #44 (comment) and some more related discussion in #12 (comment), the ranges have been lifted from another project I worked on.

I'd like to open up a chat about this method, I think it's quite interesting, all the tests pass except for the one which enumerates a long list of characters.

Interestingly characters like Ø don't decompose with this method, what does that mean exactly? does it mean that converting it to a latin capital O is correct or incorrect?

# remove accents from string
not ok 1 should be equivalent
  ---
    operator: deepEqual
    expected: |-
      'AAAAAAAAAEAACCEEEEEEEEIIIIIDNOOOOOOOOOUUUUYaaaaaaaaaeaacceeeeeeeeiiiiinooooooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgGgHhHhIiIiIiIiIiIJijJjKkKkLlLlLlLlllMmNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuUuUuAaAEaeOodTHthPpSsXxГгКкAaEeIiNnOoOoUuWwYyAaEeIiOoRrUuAAAEaaaeCcHhKkMmNnPpRrTtVvXxYyAEIOaeioRrUustSTBbFfGgHhJjKkMmPpQqSsVvWwXxYyAaBbDdEeEeHhIiIiMmOoQqUuXxZzss'
    actual: |-
      'AAAAAAAAÆAACCEEEEEEEEIIIIIÐNOOOOOØOOOUUUUYaaaaaaaaæaacceeeeeeeeiiiiinoooooøooouuuuyyAaAaAaCcCcCcCcDdĐđEeEeEeEeEeGgGgGgGgGgHhĦħIiIiIiIiIıIJijJjKkKkLlLlLlL·l·ŁłMmNnNnNnʼnOoOoOoŒœRrRrRrSsSsSsSsTtTtŦŧUuUuUuUuUuUuWwWwYyYZzZzZzsƒOoUuAaIiOoUuUuUuUuUuUuUuAaÆæØøðÞþPpSsXxГгКкAaEeIiNnOoOoUuWwYyAaEeIiOoRrUuAAAEaaaeCcHhKkMmNnPpRrTtVvXxYyAEIOaeioRrUustSTBbFfGgHhJjKkMmPpQqSsVvWwXxYyAaBbDdEeƐɛHhIiƗɨMmOoQqUuXxZzß'

edit: sorry about the formatting, my editor seems to have automatically applied the Standard JS style, I can revert those change if we decide to proceed with it.

…e regenerate lib for pattern matching

missinglink · 2023-07-18T19:04:40Z

Note that the regenerate dependency can be removed in favour of the pattern it generates, namely:

[\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u200D\u20D0-\u20FF\u3099\u309A\uFE00-\uFE0F\uFE20-\uFE2F]

tyxla · 2023-07-20T11:59:48Z

Thanks for the PR, @missinglink 🙌

Interestingly characters like Ø don't decompose with this method, what does that mean exactly? does it mean that converting it to a latin capital O is correct or incorrect?

Well, we intentionally added replacement for these characters and it's a good example of why such a library is preferred to using String.normalize().

That being said, I'd welcome a simplification of the current approach that supports all current characters that we replace.

refactor: experiment using unicode decomposition+recomposition and th…

0a2af9a

…e regenerate lib for pattern matching

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experiment using unicode decomposition & regex char ranges#45

experiment using unicode decomposition & regex char ranges#45
missinglink wants to merge 1 commit into
tyxla:masterfrom
missinglink:regenerate

missinglink commented Jul 18, 2023 •

edited

Loading

Uh oh!

missinglink commented Jul 18, 2023

Uh oh!

tyxla commented Jul 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

missinglink commented Jul 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

missinglink commented Jul 18, 2023

Uh oh!

tyxla commented Jul 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

missinglink commented Jul 18, 2023 •

edited

Loading