fix(deburr): handle ligatures and special Latin characters#199
Merged
Conversation
The naive NFD normalization approach failed for characters that don't
decompose via NFD, such as æ, œ, ß, ø, ð, þ, ł, and others from
Latin-1 Supplement and Latin Extended-A blocks.
Added a sparse array lookup (31 entries indexed by char code) for these
characters, combined with a single-pass loop that both skips combining
marks and replaces ligatures — eliminating the regex in the hot path.
This makes deburr both correct and ~10% faster than the old version,
and ~2.4x faster than lodash.
Examples that now work:
deburr('Hællæ, hva skjera?') // => 'Haellae, hva skjera?'
deburr('Straße') // => 'Strasse'
deburr('Œuvre') // => 'OEuvre'
Also fixes kebabCase, camelCase, snakeCase, pascalCase, and titleCase
which all use deburr internally.
🦋 Changeset detectedLatest commit: 5a1e92c The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
✅ Deploy Preview for moderndash ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
deburrused naive NFD normalization which only handles characters with decomposed forms (e.g.é→e). It completely failed for ligatures and special Latin letters that don't decompose via NFD:This also affected
kebabCase,camelCase,snakeCase,pascalCase, andtitleCasewhich all usedeburrinternally.Solution
Added a sparse array lookup (31 entries indexed by char code) covering every character in Latin-1 Supplement (U+00C0–U+00FF) and Latin Extended-A (U+0100–U+017F) that doesn't decompose via NFD:
Combined with a single-pass loop that both skips combining marks and replaces ligatures — no regex in the hot path.
Performance
Benchmarked with warmup (3 runs, consistent results):
The new implementation is both more correct and faster than the old version.
Changes
package/src/string/deburr.ts— Rewrote with sparse array ligature map + NFD single-pass looppackage/test/string/deburr.test.ts— Added tests for Latin-1 Supplement ligatures, Latin Extended-A, and mixed diacritics+ligaturesbenchmark/string/deburr.bench.ts— Updated test charset to include ligatures