Skip to content

stella/fuzzy-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

stella

@stll/fuzzy-search

NAPI-RS approximate substring matching for Node.js and Bun. Finds near-matches within edit distance k with stable UTF-16 offsets, replace-safe match ranges, and optional diacritics normalization.

Built on Myers' bit-parallel algorithm (1999), implemented in Rust and exposed to JavaScript via NAPI-RS.

Install

npm install @stll/fuzzy-search
# or
bun add @stll/fuzzy-search

The companion @stll/fuzzy-search-wasm package is available for browser builds.

If you use the browser package with Vite, import the bundled plugin so the generated WASM loader is not pre-bundled into broken asset URLs:

import { defineConfig } from "vite";
import stllFuzzySearchWasm from "@stll/fuzzy-search-wasm/vite";

export default defineConfig({
  plugins: [stllFuzzySearchWasm()],
});

Prebuilts are available for:

Platform Architecture
macOS x64, arm64
Linux (glibc) x64, arm64
WASM browser

Usage

import { FuzzySearch } from "@stll/fuzzy-search";

const fs = new FuzzySearch(
  [
    { pattern: "Gaislerová", distance: 1 },
    { pattern: "Novák", distance: 1 },
    { pattern: "Příbram", distance: 2 },
  ],
  {
    normalizeDiacritics: true,
    wholeWords: true,
  },
);

fs.findIter("Smlouva s Gais1erová v Pribram");
// [
//   { pattern: 0, start: 10, end: 20,
//     text: "Gais1erová", distance: 1 },
//   { pattern: 2, start: 23, end: 30,
//     text: "Pribram", distance: 0 },
// ]

Patterns

Patterns can be strings (default distance 1) or objects with explicit distance and optional name:

const fs = new FuzzySearch([
  "simple", // distance 1
  { pattern: "named", name: "entity" }, // distance 1
  { pattern: "precise", distance: 2 }, // distance 2
]);

Distance must be less than pattern length.

Options

const fs = new FuzzySearch(patterns, {
  // Strip diacritics before matching (NFD + remove
  // combining marks). "Příbram" matches "Pribram"
  // at distance 0.
  normalizeDiacritics: true, // default: false

  // Only match whole words. Uses Unicode
  // is_alphanumeric() for boundary detection.
  // CJK characters always pass (no inter-word
  // spaces in CJK).
  wholeWords: true, // default: true

  // Case-insensitive matching (Unicode-aware).
  caseInsensitive: true, // default: false

  // Unicode word boundaries (reserved for future
  // UAX#29 segmentation support).
  unicodeBoundaries: true, // default: true

  // Drop matches whose score is below threshold.
  // Score = 1 - distance / pattern.length.
  // Inclusive (score >= minScore keeps the match).
  minScore: 0.7,

  // Return only the top k matches by score, across
  // all patterns. Tie-broken by start, then pattern.
  kBest: 5,
});

Scored output

Every match carries a normalized score in [0, 1], computed as 1 - distance / pattern.length and clamped at 0. Pair it with minScore and kBest for top-N ranking without a follow-up sort:

const fs = new FuzzySearch(
  [
    { pattern: "Novák", distance: 2 },
    { pattern: "Gaislerová", distance: 2 },
  ],
  { wholeWords: true, minScore: 0.7, kBest: 3 },
);

fs.findIter("Nowák a Gais1erova");
// [
//   { pattern: 0, text: "Nowák", distance: 1, score: 0.8, ... },
//   { pattern: 1, text: "Gais1erova", distance: 2, score: 0.8, ... },
// ]

replaceAll always replaces every distance-qualified match and ignores minScore / kBest, so the replacements-by-pattern contract stays deterministic.

Replace

fs.replaceAll("Smlouva s Gais1erová", [
  "[REDACTED]",
  "[REDACTED]",
  "[REDACTED]",
]);
// "Smlouva s [REDACTED]"

replacements[i] replaces pattern i.

Distance helper

import { distance } from "@stll/fuzzy-search";

distance("kitten", "sitting"); // 3
distance("abcd", "abdc", "damerau-levenshtein"); // 1

Benchmarks

The repository includes a checked-in benchmark harness for synthetic and corpus-based searches. The inputs are public and the scripts are reproducible from the repo. Run them locally:

bun run bench:install
bun run bench:download
bun run bench:speed
bun run bench:correctness

The speed harness compares practical JS ecosystem alternatives, but not every comparator implements the same exact semantics. @stll/fuzzy-search is solving approximate substring search with offsets and replacement-friendly match ranges; tools like fuse.js and fuzzball are included as reference points, not as exact drop-in equivalents. The headline comparisons in this repo are the substring-mode rows against sliding-window Levenshtein baselines.

Representative baseline from the checked-in public harness on this machine:

  • runtime: Bun 1.3.12
  • platform: macOS 26.4.1 (Darwin arm64)
Scenario @stll/fuzzy-search Sliding-window JS baseline Relative
Czech legal, 64 KB, 5 names 2.41 ms 80.78 ms 33.5x
Bible, 4.0 MB, 5 names 239.91 ms 3903.26 ms 16.3x
Czech news, 4.8 MB, 5 names 262.39 ms 4350.52 ms 16.6x
German news, 5.5 MB, 5 names 405.72 ms 6816.03 ms 16.8x

These rows are substring mode (wholeWords: false) with edit distance 1-2, which is the core workload this package is designed for.

Alternatives tested
  • fastest-levenshtein + sliding window — fastest JS Levenshtein distance
  • fuse.js — fuzzy search (scoring, not substring matching)
  • fuzzball — Python rapidfuzz port
  • naive JS — O(nm) Levenshtein per window position

Correctness

Correctness is covered by example-based tests and property tests. The property suite verifies distance bounds, oracle agreement, whole-word boundaries, UTF-16 offset stability, normalization behavior, and mixed option combinations over randomized inputs.

API

Method Returns Description
new FuzzySearch(patterns, options?) instance Build matcher
.findIter(haystack) FuzzyMatch[] Non-overlapping matches
.isMatch(haystack) boolean Any pattern matches?
.replaceAll(haystack, replacements) string Replace matched patterns
.patternCount number Number of patterns

Types

type PatternEntry =
  | string
  | { pattern: string; distance?: number; name?: string };

type Options = {
  normalizeDiacritics?: boolean; // default: false
  wholeWords?: boolean; // default: true
  caseInsensitive?: boolean; // default: false
  unicodeBoundaries?: boolean; // default: true
  minScore?: number; // drop matches below threshold
  kBest?: number; // top-k by score, ties by start
};

type FuzzyMatch = {
  pattern: number; // index into patterns array
  start: number; // UTF-16 code unit offset
  end: number; // exclusive
  text: string; // matched substring
  distance: number; // actual Levenshtein distance
  score: number; // 1 - distance/pattern.length
  name?: string; // pattern name (if provided)
};

Match offsets are UTF-16 code unit indices, compatible with String.prototype.slice().

Error handling

  • Constructor throws if a pattern is empty, longer than 64 characters, or has distance >= pattern length.
  • replaceAll throws if replacements.length does not equal patternCount.

How it works

  1. Myers' bit-parallel algorithm scans the text in O(n) per pattern for patterns up to 64 characters. No DFA construction, no state explosion at higher distances.

  2. Start position recovery via small-window Levenshtein: for each match end position from Myers, a window of [m-k, m+k] characters is evaluated to find the exact start and distance.

  3. Diacritics normalization: NFD decomposition + combining mark stripping (Unicode General Category M via unicode-normalization crate). Covers all scripts.

  4. UTF-16 offset translation: character-level matching with incremental char→UTF-16 mapping for JS string compatibility.

Limitations

  • Pattern length capped at 64 characters. Myers uses a single u64 bit-vector per pattern. Longer patterns would need multi-word vectors (not yet implemented).
  • No streaming API. The full haystack must be in memory. For chunked processing, use @stll/aho-corasick's StreamMatcher for exact prefiltering and fuzzy-search on flagged regions.
  • WASM requires SharedArrayBuffer. Browser builds need Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers.

Development

bun install
bun run build           # native module (requires Rust)
bun test                # 36 unit tests
bun run test:props      # 36 property tests × 1000 runs

bun run bench:install   # benchmark dependencies
bun run bench:download  # download corpora
bun run bench:speed     # speed comparison
bun run bench:correctness  # oracle verification

bun run lint            # oxlint
bun run format          # oxfmt + rustfmt

License

MIT