stella · jan-kubica · Jun 12, 2026 · chatgpt-codex-connector · Jun 12, 2026 · gemini-code-assist
diff --git a/README.md b/README.md
@@ -41,6 +41,17 @@ echo "Contact Jan Novák at jan.novak@example.com" | bunx @stll/anonymize-cli
 - NER, coreference handling, and confidence boosting
 - Native, browser, and Vite-compatible entrypoints
 
+## Benchmarks
+
+[`packages/bench`](packages/bench) holds reproducible throughput and
+quality benchmarks for the deterministic pipeline, plus comparison
+runs of Microsoft Presidio and compromise on the same legal-contract
+corpus scored by the same scorer. See
+[`packages/bench/results/RESULTS.md`](packages/bench/results/RESULTS.md)
+for current numbers and
+[`packages/bench/README.md`](packages/bench/README.md) for the
+methodology and its limits.
+
 ## Development
 
 ```bash
@@ -70,3 +81,4 @@ bun run hooks:install
 - [`packages/anonymize`](packages/anonymize)
 - [`packages/data`](packages/data)
 - [`packages/anonymize/wasm`](packages/anonymize/wasm)
+- [`packages/bench`](packages/bench)
diff --git a/bun.lock b/bun.lock
diff --git a/packages/bench/README.md b/packages/bench/README.md
@@ -97,6 +97,65 @@ bun run bench:quality -- --predictions path/to/predictions.json \
 bun run bench:render
 ```
 
+## Comparison runs
+
+Committed results include two external tools run on the same corpus
+and scored by the same scorer. Both runs are restricted (via
+`--labels`) to labels the tool claims to detect, so micro averages
+are not comparable across tools with different filters; compare per
+label.
+
+### Microsoft Presidio
+
+`comparison/presidio/run.py` (pinned deps in `requirements.txt`)
+runs `presidio-analyzer` with its documented spaCy defaults
+(`en_core_web_lg`, `de_core_news_lg`) and writes the interchange
+format. Scored labels: person, organization, email address, phone
+number, date.
+
+Read the numbers with these caveats:
+
+- **Czech is skipped entirely**: Presidio has no Czech language
+  support, so 8 of 13 corpus documents cannot be processed at all.
+- **Organizations are enabled deliberately.** Presidio ignores
+  spaCy `ORG` spans by default because they are noisy; the run
+  enables them because organizations are unavoidable in legal
+  contracts. The resulting false-positive count shows why the
+  default exists.
+- **`DATE_TIME` is broader than the reference `date` label** (it
+  also matches durations and relative time), which depresses
+  Presidio's date precision; this is a label-mapping asymmetry, not
+  purely a detection failure.
+- Labels Presidio has no recognizers for on this corpus
+  (registration numbers, tax identifiers, monetary amounts,
+  addresses as street-level spans) are excluded rather than scored
+  as zero.
+
+Reproduce:
+
+```sh
+python3 -m venv .venv && .venv/bin/pip install -r comparison/presidio/requirements.txt
+.venv/bin/python -m spacy download en_core_web_lg
+.venv/bin/python -m spacy download de_core_news_lg
+.venv/bin/python comparison/presidio/run.py
+bun src/run-quality.ts --predictions results/predictions.presidio.json \
+  --labels "person,organization,email address,phone number,date"
+bun run bench:render
+```
+
+### compromise
+
+`src/run-compromise.ts` runs the compromise NLP library (the
+closest JS-ecosystem baseline that reports spans) on the English
+documents only; scored labels: person, organization.
+
+```sh
+bun src/run-compromise.ts
+bun src/run-quality.ts --predictions results/predictions.compromise.json \
+  --labels "person,organization"
+bun run bench:render
+```
+
 ## Throughput methodology
 
 One-time costs (dictionary load, search automaton preparation) are

diff --git a/packages/bench/comparison/presidio/requirements.txt b/packages/bench/comparison/presidio/requirements.txt
@@ -0,0 +1,5 @@
+presidio-analyzer==2.2.359
+spacy==3.8.13
+# Models (installed via `python -m spacy download <name>`):
+#   en_core_web_lg
+#   de_core_news_lg
diff --git a/packages/bench/comparison/presidio/run.py b/packages/bench/comparison/presidio/run.py
@@ -0,0 +1,112 @@
+"""Runs Microsoft Presidio over the bench contract corpus and writes
+predictions in the bench interchange format (packages/bench/README.md).
+
+Czech fixtures are skipped: Presidio has no Czech language support
+(no spaCy model and no Czech recognizers); that absence is reported
+in the results rather than scored as zero.
+
+Offsets are converted from Python code-point indices to UTF-16 code
+units to match the reference annotations.
+
+Usage:
+  python run.py [--out ../../results/predictions.presidio.json]
+"""
+
+import argparse
+import json
+from pathlib import Path
+
+from presidio_analyzer import AnalyzerEngine
+from presidio_analyzer.nlp_engine import NlpEngineProvider
+
+LANGUAGE_MODELS = {"en": "en_core_web_lg", "de": "de_core_news_lg"}
+
+LABEL_MAP = {
+    "PERSON": "person",
+    "ORGANIZATION": "organization",
+    "EMAIL_ADDRESS": "email address",
+    "PHONE_NUMBER": "phone number",
+    "DATE_TIME": "date",
+}
+
+FIXTURES_DIR = (
+    Path(__file__).resolve().parents[3]
+    / "anonymize"
+    / "src"
+    / "__test__"
+    / "fixtures"
+    / "contracts"
+)
+DEFAULT_OUT = (
+    Path(__file__).resolve().parents[2] / "results" / "predictions.presidio.json"
+)
+
+
+def utf16_offsets(text: str) -> list[int]:
+    """Cumulative UTF-16 code-unit offset for each code-point index."""
+    offsets = [0] * (len(text) + 1)
+    for index, char in enumerate(text):
+        offsets[index + 1] = offsets[index] + (2 if ord(char) > 0xFFFF else 1)
+    return offsets
+
+
+def build_analyzer() -> AnalyzerEngine:
+    configuration = {
+        "nlp_engine_name": "spacy",
+        "models": [
+            {"lang_code": lang, "model_name": model}
+            for lang, model in LANGUAGE_MODELS.items()
+        ],
+        # Default Presidio config ignores ORG spans from spaCy; the
+        # comparison needs organizations, so keep only the truly
+        # non-PII tags ignored.
+        "ner_model_configuration": {
+            "labels_to_ignore": ["CARDINAL", "ORDINAL", "QUANTITY", "PERCENT"],
+        },
+    }
+    provider = NlpEngineProvider(nlp_configuration=configuration)
+    return AnalyzerEngine(
+        nlp_engine=provider.create_engine(),
+        supported_languages=list(LANGUAGE_MODELS),
+    )
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--out", type=Path, default=DEFAULT_OUT)
+    args = parser.parse_args()
+
+    analyzer = build_analyzer()
+    docs = []
+    for language_dir in sorted(FIXTURES_DIR.iterdir()):
+        language = language_dir.name
+        if language not in LANGUAGE_MODELS:
+            print(f"skipping {language}: no Presidio language support")
+            continue
-    for language_dir in sorted(FIXTURES_DIR.iterdir()):
-        language = language_dir.name
-        if language not in LANGUAGE_MODELS:
-            print(f"skipping {language}: no Presidio language support")
-            continue
+    for language_dir in sorted(FIXTURES_DIR.iterdir()):
+        if not language_dir.is_dir():
+            continue
+        language = language_dir.name
+        if language not in LANGUAGE_MODELS:
+            print(f"skipping {language}: no Presidio language support")
+            continue
-    for language_dir in sorted(FIXTURES_DIR.iterdir()):
-        language = language_dir.name
-        if language not in LANGUAGE_MODELS:
-            print(f"skipping {language}: no Presidio language support")
-            continue
+    for language_dir in sorted(FIXTURES_DIR.iterdir()):
+        if not language_dir.is_dir():
+            continue
+        language = language_dir.name
+        if language not in LANGUAGE_MODELS:
+            print(f"skipping {language}: no Presidio language support")
+            continue
+        for fixture in sorted(language_dir.glob("*.txt")):
+            text = fixture.read_text(encoding="utf-8").replace("\r\n", "\n")
+            offsets = utf16_offsets(text)
+            results = analyzer.analyze(
+                text=text, language=language, entities=list(LABEL_MAP)
+            )
+            entities = [
+                {
+                    "start": offsets[result.start],
+                    "end": offsets[result.end],
+                    "label": LABEL_MAP[result.entity_type],
+                }
+                for result in results
+            ]
+            docs.append({"id": f"{language}/{fixture.name}", "entities": entities})
+            print(f"{language}/{fixture.name}: {len(entities)} entities")
+
+    args.out.parent.mkdir(parents=True, exist_ok=True)
+    args.out.write_text(
+        json.dumps({"tool": "presidio", "docs": docs}, indent=2) + "\n",
+        encoding="utf-8",
+    )
+    print(f"written: {args.out}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/packages/bench/package.json b/packages/bench/package.json
@@ -21,6 +21,7 @@
   "devDependencies": {
     "@types/node": "^25.9.2",
     "bun-types": "^1.3.14",
+    "compromise": "^14.15.1",
     "typescript": "^6.0.3"
   }
 }
diff --git a/packages/bench/results/RESULTS.md b/packages/bench/results/RESULTS.md
@@ -83,3 +83,71 @@ The reference annotations derive from reviewed pipeline output, so the anonymize
 | cs       |  207 |    100.0% | 100.0% | 100.0% |
 | de       |   24 |    100.0% | 100.0% | 100.0% |
 | en       |  101 |    100.0% | 100.0% | 100.0% |
+
+### compromise
+
+4 documents, 101 reference entities. Scored labels: person, organization.
+
+Skipped 9 corpus documents (no support for: cs, de).
+
+#### exact match
+
+| Label           | Gold | Precision | Recall |    F1 |
+| --------------- | ---: | --------: | -----: | ----: |
+| organization    |   19 |     12.5% |  15.8% | 14.0% |
+| person          |   19 |     40.0% |  63.2% | 49.0% |
+| **all (micro)** |   38 |     27.8% |  39.5% | 32.6% |
+
+| Language | Gold | Precision | Recall |    F1 |
+| -------- | ---: | --------: | -----: | ----: |
+| en       |   38 |     27.8% |  39.5% | 32.6% |
+
+#### overlap match
+
+| Label           | Gold | Precision | Recall |    F1 |
+| --------------- | ---: | --------: | -----: | ----: |
+| organization    |   19 |     58.3% |  73.7% | 65.1% |
+| person          |   19 |     53.3% |  84.2% | 65.3% |
+| **all (micro)** |   38 |     55.6% |  78.9% | 65.2% |
+
+| Language | Gold | Precision | Recall |    F1 |
+| -------- | ---: | --------: | -----: | ----: |
+| en       |   38 |     55.6% |  78.9% | 65.2% |
+
+### presidio
+
+5 documents, 125 reference entities. Scored labels: person, organization, email address, phone number, date.
+
+Skipped 8 corpus documents (no support for: cs).
+
+#### exact match
+
+| Label           | Gold | Precision | Recall |    F1 |
+| --------------- | ---: | --------: | -----: | ----: |
+| date            |   27 |     14.4% |  51.9% | 22.6% |
+| email address   |    1 |      0.0% |   0.0% |  0.0% |
+| organization    |   23 |      6.9% |  60.9% | 12.4% |
+| person          |   24 |     59.3% |  66.7% | 62.7% |
+| phone number    |    1 |      0.0% |   0.0% |  0.0% |
+| **all (micro)** |   76 |     13.4% |  57.9% | 21.8% |
+
+| Language | Gold | Precision | Recall |    F1 |
+| -------- | ---: | --------: | -----: | ----: |
+| de       |   12 |     30.0% |  25.0% | 27.3% |
+| en       |   64 |     12.9% |  64.1% | 21.5% |
+
+#### overlap match
+
+| Label           | Gold | Precision | Recall |    F1 |
+| --------------- | ---: | --------: | -----: | ----: |
+| date            |   27 |     23.7% |  85.2% | 37.1% |
+| email address   |    1 |      0.0% |   0.0% |  0.0% |
+| organization    |   23 |      9.4% |  82.6% | 16.9% |
+| person          |   24 |     81.5% |  91.7% | 86.3% |
+| phone number    |    1 |     50.0% | 100.0% | 66.7% |
+| **all (micro)** |   76 |     19.8% |  85.5% | 32.2% |
+
+| Language | Gold | Precision | Recall |    F1 |
+| -------- | ---: | --------: | -----: | ----: |
+| de       |   12 |     60.0% |  50.0% | 54.5% |
+| en       |   64 |     18.6% |  92.2% | 30.9% |