Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,17 @@ echo "Contact Jan Novák at jan.novak@example.com" | bunx @stll/anonymize-cli
- NER, coreference handling, and confidence boosting
- Native, browser, and Vite-compatible entrypoints

## Benchmarks

[`packages/bench`](packages/bench) holds reproducible throughput and
quality benchmarks for the deterministic pipeline, plus comparison
runs of Microsoft Presidio and compromise on the same legal-contract
corpus scored by the same scorer. See
[`packages/bench/results/RESULTS.md`](packages/bench/results/RESULTS.md)
for current numbers and
[`packages/bench/README.md`](packages/bench/README.md) for the
methodology and its limits.

## Development

```bash
Expand Down Expand Up @@ -70,3 +81,4 @@ bun run hooks:install
- [`packages/anonymize`](packages/anonymize)
- [`packages/data`](packages/data)
- [`packages/anonymize/wasm`](packages/anonymize/wasm)
- [`packages/bench`](packages/bench)
9 changes: 9 additions & 0 deletions bun.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

59 changes: 59 additions & 0 deletions packages/bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,65 @@ bun run bench:quality -- --predictions path/to/predictions.json \
bun run bench:render
```

## Comparison runs

Committed results include two external tools run on the same corpus
and scored by the same scorer. Both runs are restricted (via
`--labels`) to labels the tool claims to detect, so micro averages
are not comparable across tools with different filters; compare per
label.

### Microsoft Presidio

`comparison/presidio/run.py` (pinned deps in `requirements.txt`)
runs `presidio-analyzer` with its documented spaCy defaults
(`en_core_web_lg`, `de_core_news_lg`) and writes the interchange
format. Scored labels: person, organization, email address, phone
number, date.

Read the numbers with these caveats:

- **Czech is skipped entirely**: Presidio has no Czech language
support, so 8 of 13 corpus documents cannot be processed at all.
- **Organizations are enabled deliberately.** Presidio ignores
spaCy `ORG` spans by default because they are noisy; the run
enables them because organizations are unavoidable in legal
contracts. The resulting false-positive count shows why the
default exists.
- **`DATE_TIME` is broader than the reference `date` label** (it
also matches durations and relative time), which depresses
Presidio's date precision; this is a label-mapping asymmetry, not
purely a detection failure.
- Labels Presidio has no recognizers for on this corpus
(registration numbers, tax identifiers, monetary amounts,
addresses as street-level spans) are excluded rather than scored
as zero.

Reproduce:

```sh
python3 -m venv .venv && .venv/bin/pip install -r comparison/presidio/requirements.txt
.venv/bin/python -m spacy download en_core_web_lg
.venv/bin/python -m spacy download de_core_news_lg
.venv/bin/python comparison/presidio/run.py
bun src/run-quality.ts --predictions results/predictions.presidio.json \
--labels "person,organization,email address,phone number,date"
bun run bench:render
```

### compromise

`src/run-compromise.ts` runs the compromise NLP library (the
closest JS-ecosystem baseline that reports spans) on the English
documents only; scored labels: person, organization.

```sh
bun src/run-compromise.ts
bun src/run-quality.ts --predictions results/predictions.compromise.json \
--labels "person,organization"
bun run bench:render
```

## Throughput methodology

One-time costs (dictionary load, search automaton preparation) are
Expand Down
5 changes: 5 additions & 0 deletions packages/bench/comparison/presidio/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
presidio-analyzer==2.2.359
spacy==3.8.13
# Models (installed via `python -m spacy download <name>`):
# en_core_web_lg
# de_core_news_lg
Comment on lines +3 to +5

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin the spaCy model package versions

The Presidio comparison is meant to be reproducible, but these model installs are left as bare names, so python -m spacy download en_core_web_lg / de_core_news_lg will install the best compatible model available at rerun time rather than the exact model used for the committed numbers. If spaCy publishes a new compatible model, the same pinned requirements.txt can produce different entities and benchmark results; pin the model wheel versions or direct download names alongside the Python deps.

Useful? React with 👍 / 👎.

112 changes: 112 additions & 0 deletions packages/bench/comparison/presidio/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
"""Runs Microsoft Presidio over the bench contract corpus and writes
predictions in the bench interchange format (packages/bench/README.md).

Czech fixtures are skipped: Presidio has no Czech language support
(no spaCy model and no Czech recognizers); that absence is reported
in the results rather than scored as zero.

Offsets are converted from Python code-point indices to UTF-16 code
units to match the reference annotations.

Usage:
python run.py [--out ../../results/predictions.presidio.json]
"""

import argparse
import json
from pathlib import Path

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

LANGUAGE_MODELS = {"en": "en_core_web_lg", "de": "de_core_news_lg"}

LABEL_MAP = {
"PERSON": "person",
"ORGANIZATION": "organization",
"EMAIL_ADDRESS": "email address",
"PHONE_NUMBER": "phone number",
"DATE_TIME": "date",
}

FIXTURES_DIR = (
Path(__file__).resolve().parents[3]
/ "anonymize"
/ "src"
/ "__test__"
/ "fixtures"
/ "contracts"
)
DEFAULT_OUT = (
Path(__file__).resolve().parents[2] / "results" / "predictions.presidio.json"
)


def utf16_offsets(text: str) -> list[int]:
"""Cumulative UTF-16 code-unit offset for each code-point index."""
offsets = [0] * (len(text) + 1)
for index, char in enumerate(text):
offsets[index + 1] = offsets[index] + (2 if ord(char) > 0xFFFF else 1)
return offsets


def build_analyzer() -> AnalyzerEngine:
configuration = {
"nlp_engine_name": "spacy",
"models": [
{"lang_code": lang, "model_name": model}
for lang, model in LANGUAGE_MODELS.items()
],
# Default Presidio config ignores ORG spans from spaCy; the
# comparison needs organizations, so keep only the truly
# non-PII tags ignored.
"ner_model_configuration": {
"labels_to_ignore": ["CARDINAL", "ORDINAL", "QUANTITY", "PERCENT"],
},
}
provider = NlpEngineProvider(nlp_configuration=configuration)
return AnalyzerEngine(
nlp_engine=provider.create_engine(),
supported_languages=list(LANGUAGE_MODELS),
)


def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--out", type=Path, default=DEFAULT_OUT)
args = parser.parse_args()

analyzer = build_analyzer()
docs = []
for language_dir in sorted(FIXTURES_DIR.iterdir()):
language = language_dir.name
if language not in LANGUAGE_MODELS:
print(f"skipping {language}: no Presidio language support")
continue
Comment on lines +81 to +85

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent noisy console output or potential errors when encountering non-directory files (such as .DS_Store or README.md) in the fixtures directory, it is safer to explicitly verify that each item is a directory before processing.

Suggested change
for language_dir in sorted(FIXTURES_DIR.iterdir()):
language = language_dir.name
if language not in LANGUAGE_MODELS:
print(f"skipping {language}: no Presidio language support")
continue
for language_dir in sorted(FIXTURES_DIR.iterdir()):
if not language_dir.is_dir():
continue
language = language_dir.name
if language not in LANGUAGE_MODELS:
print(f"skipping {language}: no Presidio language support")
continue

for fixture in sorted(language_dir.glob("*.txt")):
text = fixture.read_text(encoding="utf-8").replace("\r\n", "\n")
offsets = utf16_offsets(text)
results = analyzer.analyze(
text=text, language=language, entities=list(LABEL_MAP)
)
entities = [
{
"start": offsets[result.start],
"end": offsets[result.end],
"label": LABEL_MAP[result.entity_type],
}
for result in results
]
docs.append({"id": f"{language}/{fixture.name}", "entities": entities})
print(f"{language}/{fixture.name}: {len(entities)} entities")

args.out.parent.mkdir(parents=True, exist_ok=True)
args.out.write_text(
json.dumps({"tool": "presidio", "docs": docs}, indent=2) + "\n",
encoding="utf-8",
)
print(f"written: {args.out}")


if __name__ == "__main__":
main()
1 change: 1 addition & 0 deletions packages/bench/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
"devDependencies": {
"@types/node": "^25.9.2",
"bun-types": "^1.3.14",
"compromise": "^14.15.1",
"typescript": "^6.0.3"
}
}
68 changes: 68 additions & 0 deletions packages/bench/results/RESULTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,3 +83,71 @@ The reference annotations derive from reviewed pipeline output, so the anonymize
| cs | 207 | 100.0% | 100.0% | 100.0% |
| de | 24 | 100.0% | 100.0% | 100.0% |
| en | 101 | 100.0% | 100.0% | 100.0% |

### compromise

4 documents, 101 reference entities. Scored labels: person, organization.

Skipped 9 corpus documents (no support for: cs, de).

#### exact match

| Label | Gold | Precision | Recall | F1 |
| --------------- | ---: | --------: | -----: | ----: |
| organization | 19 | 12.5% | 15.8% | 14.0% |
| person | 19 | 40.0% | 63.2% | 49.0% |
| **all (micro)** | 38 | 27.8% | 39.5% | 32.6% |

| Language | Gold | Precision | Recall | F1 |
| -------- | ---: | --------: | -----: | ----: |
| en | 38 | 27.8% | 39.5% | 32.6% |

#### overlap match

| Label | Gold | Precision | Recall | F1 |
| --------------- | ---: | --------: | -----: | ----: |
| organization | 19 | 58.3% | 73.7% | 65.1% |
| person | 19 | 53.3% | 84.2% | 65.3% |
| **all (micro)** | 38 | 55.6% | 78.9% | 65.2% |

| Language | Gold | Precision | Recall | F1 |
| -------- | ---: | --------: | -----: | ----: |
| en | 38 | 55.6% | 78.9% | 65.2% |

### presidio

5 documents, 125 reference entities. Scored labels: person, organization, email address, phone number, date.

Skipped 8 corpus documents (no support for: cs).

#### exact match

| Label | Gold | Precision | Recall | F1 |
| --------------- | ---: | --------: | -----: | ----: |
| date | 27 | 14.4% | 51.9% | 22.6% |
| email address | 1 | 0.0% | 0.0% | 0.0% |
| organization | 23 | 6.9% | 60.9% | 12.4% |
| person | 24 | 59.3% | 66.7% | 62.7% |
| phone number | 1 | 0.0% | 0.0% | 0.0% |
| **all (micro)** | 76 | 13.4% | 57.9% | 21.8% |

| Language | Gold | Precision | Recall | F1 |
| -------- | ---: | --------: | -----: | ----: |
| de | 12 | 30.0% | 25.0% | 27.3% |
| en | 64 | 12.9% | 64.1% | 21.5% |

#### overlap match

| Label | Gold | Precision | Recall | F1 |
| --------------- | ---: | --------: | -----: | ----: |
| date | 27 | 23.7% | 85.2% | 37.1% |
| email address | 1 | 0.0% | 0.0% | 0.0% |
| organization | 23 | 9.4% | 82.6% | 16.9% |
| person | 24 | 81.5% | 91.7% | 86.3% |
| phone number | 1 | 50.0% | 100.0% | 66.7% |
| **all (micro)** | 76 | 19.8% | 85.5% | 32.2% |

| Language | Gold | Precision | Recall | F1 |
| -------- | ---: | --------: | -----: | ----: |
| de | 12 | 60.0% | 50.0% | 54.5% |
| en | 64 | 18.6% | 92.2% | 30.9% |
Loading