Skip to content

experiment using unicode decomposition & regex char ranges#45

Draft
missinglink wants to merge 1 commit into
tyxla:masterfrom
missinglink:regenerate
Draft

experiment using unicode decomposition & regex char ranges#45
missinglink wants to merge 1 commit into
tyxla:masterfrom
missinglink:regenerate

Conversation

@missinglink

@missinglink missinglink commented Jul 18, 2023

Copy link
Copy Markdown

DRAFT: this is not really intended for merging but instead as a discussion point regarding how we might be able to identify accents without manually enumerating them all.

The tl;dr is:

string
    .normalize('NFKD')
    .replace(COMBINING_MARKS, '')
    .normalize('NFKC')

There is a description of the method in #44 (comment) and some more related discussion in #12 (comment), the ranges have been lifted from another project I worked on.

I'd like to open up a chat about this method, I think it's quite interesting, all the tests pass except for the one which enumerates a long list of characters.

Interestingly characters like Ø don't decompose with this method, what does that mean exactly? does it mean that converting it to a latin capital O is correct or incorrect?

# remove accents from string
not ok 1 should be equivalent
  ---
    operator: deepEqual
    expected: |-
      'AAAAAAAAAEAACCEEEEEEEEIIIIIDNOOOOOOOOOUUUUYaaaaaaaaaeaacceeeeeeeeiiiiinooooooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgGgHhHhIiIiIiIiIiIJijJjKkKkLlLlLlLlllMmNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuUuUuAaAEaeOodTHthPpSsXxГгКкAaEeIiNnOoOoUuWwYyAaEeIiOoRrUuAAAEaaaeCcHhKkMmNnPpRrTtVvXxYyAEIOaeioRrUustSTBbFfGgHhJjKkMmPpQqSsVvWwXxYyAaBbDdEeEeHhIiIiMmOoQqUuXxZzss'
    actual: |-
      'AAAAAAAAÆAACCEEEEEEEEIIIIIÐNOOOOOØOOOUUUUYaaaaaaaaæaacceeeeeeeeiiiiinoooooøooouuuuyyAaAaAaCcCcCcCcDdĐđEeEeEeEeEeGgGgGgGgGgHhĦħIiIiIiIiIıIJijJjKkKkLlLlLlL·l·ŁłMmNnNnNnʼnOoOoOoŒœRrRrRrSsSsSsSsTtTtŦŧUuUuUuUuUuUuWwWwYyYZzZzZzsƒOoUuAaIiOoUuUuUuUuUuUuUuAaÆæØøðÞþPpSsXxГгКкAaEeIiNnOoOoUuWwYyAaEeIiOoRrUuAAAEaaaeCcHhKkMmNnPpRrTtVvXxYyAEIOaeioRrUustSTBbFfGgHhJjKkMmPpQqSsVvWwXxYyAaBbDdEeƐɛHhIiƗɨMmOoQqUuXxZzß'

edit: sorry about the formatting, my editor seems to have automatically applied the Standard JS style, I can revert those change if we decide to proceed with it.

@missinglink

Copy link
Copy Markdown
Author

Note that the regenerate dependency can be removed in favour of the pattern it generates, namely:

[\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u200D\u20D0-\u20FF\u3099\u309A\uFE00-\uFE0F\uFE20-\uFE2F]

@tyxla

tyxla commented Jul 20, 2023

Copy link
Copy Markdown
Owner

Thanks for the PR, @missinglink 🙌

Interestingly characters like Ø don't decompose with this method, what does that mean exactly? does it mean that converting it to a latin capital O is correct or incorrect?

Well, we intentionally added replacement for these characters and it's a good example of why such a library is preferred to using String.normalize().

That being said, I'd welcome a simplification of the current approach that supports all current characters that we replace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants