Document Anonymizer
Privacy-first German PII detection and document redaction
Quick Start · Architecture · Tech Stack · Development
Detect personally identifiable information in German text and PDF files using seven custom recognizers plus spaCy NER. Anonymize with five strategies (replace, fake, mask, hash, redact) and perform physical PDF redaction that removes text from the content stream — not cosmetic overlay. Built for environments where data protection matters: financial regulation (BaFin), healthcare, legal, and public sector.
git clone git@github.com:JuliusScheuerer/document-anonymizer.git
cd document-anonymizer
uv sync --dev
uv run uvicorn document_anonymizer.api.app:app --reload
# http://localhost:8000 → Web UI
# http://localhost:8000/docs → OpenAPI docsDocker:
docker compose up -d --build
# Health check: curl http://localhost:8000/healthThe Docker image downloads the spaCy model at build time (~500 MB), so the container runs fully offline.
The HTMX-powered frontend provides a complete workflow: paste text or upload a file (TXT/PDF), adjust the confidence threshold, review detected entities with highlighted annotations, choose an anonymization strategy, and download the result.
# Detect PII entities
curl -s -X POST http://localhost:8000/api/detect \
-H "Content-Type: application/json" \
-d '{"text": "Herr Max Mustermann, IBAN DE89 3704 0044 0532 0130 00"}' | jq
# Anonymize text
curl -s -X POST http://localhost:8000/api/anonymize \
-H "Content-Type: application/json" \
-d '{"text": "Max Mustermann, Steuer-ID 12345679811", "strategy": "replace"}' | jqFull API documentation at /docs (Swagger UI) and /redoc.
┌──────────────────────────────────────────────────────┐
│ Web UI (HTMX + Jinja2) REST API (/api) │
│ Upload → Detect → Review → Anonymize / Download │
├──────────────────────────────────────────────────────┤
│ Security Middleware │
│ CSP · Rate Limiter · File Validation · Audit Log │
├────────────────────────┬─────────────────────────────┤
│ Detection Engine │ Anonymization Engine │
│ Presidio + spaCy NER │ Replace · Mask · Hash │
│ 7 German recognizers │ Fake (de_DE) · Redact │
├────────────────────────┴─────────────────────────────┤
│ Document Layer │
│ Text Handler · PDF Handler (PyMuPDF physical redact) │
└──────────────────────────────────────────────────────┘
Key properties: Zero persistence (all in-memory), air-gappable (no runtime network calls), PII-free audit logging (entity counts only, GDPR Art. 5(1)(c)), defense-in-depth validation (magic bytes, PDF structure checks, size limits).
| Recognizer | Entity Type | Detection Method |
|---|---|---|
| IBAN | DE_IBAN |
Pattern match + ISO 7064 Mod 97-10 checksum validation |
| Tax ID | DE_TAX_ID |
Steuer-ID (11-digit) + Steuernummer (regional format) with digit distribution check |
| Phone | DE_PHONE |
International (+49), domestic (0xxx), and mobile formats with context boosting |
| ID Card | DE_ID_CARD |
Restricted alphanumeric charset + check digit (weights 7, 3, 1) |
| Handelsregister | DE_HANDELSREGISTER |
HRA/HRB + number with Registergericht context boosting |
| Address | DE_ADDRESS |
5-digit PLZ + street patterns (Straße, Weg, Platz, Allee) |
| Date | DE_DATE |
DD.MM.YYYY with birth context boosting (geboren, Geburtsdatum) |
| Strategy | Output | Example |
|---|---|---|
| Replace | Entity type label | Max Mustermann → [PERSON] |
| Fake | Realistic German synthetic data | Max Mustermann → Hans Jürgen Ladeck |
| Mask | Partial character masking | DE89 3704 0044... → **** **** ****... |
| Hash | SHA-256 pseudonymization | Max Mustermann → b8a0a89e... |
| Redact | Complete removal | Max Mustermann → |
| Layer | Implementation |
|---|---|
| Input validation | Magic bytes via libmagic, PDF structure check, 10 MB size limit |
| XSS prevention | Full HTML escaping before markup insertion, strict CSP |
| Security headers | CSP, X-Frame-Options DENY, HSTS, no-referrer, Permissions-Policy |
| Rate limiting | Sliding window per IP, X-Forwarded-For spoofing protection |
| Audit trail | Structured JSON logging with request ID correlation, PII-free |
| PDF redaction | Physical text removal from content stream + metadata scrubbing |
| Static analysis | Bandit, ruff security rules (S prefix), Semgrep in CI |
See docs/security.md for the full security architecture documentation.
| Component | Technology |
|---|---|
| PII Detection | Microsoft Presidio + spaCy de_core_news_lg |
| PDF Redaction | PyMuPDF (physical redaction, not cosmetic) |
| Web Framework | FastAPI |
| Frontend | HTMX + Jinja2 (vendored, no CDN) |
| Synthetic Data | Faker de_DE locale |
| File Validation | python-magic (libmagic) |
| Logging | structlog (structured JSON) |
| Quality | ruff, mypy (strict), bandit, Hypothesis, 92% test coverage |
make check # Lint + typecheck + unit tests
make test # Unit tests (90% coverage gate)
make test-integration # API round-trip tests
make test-e2e # PDF redaction verification
make test-property # Hypothesis fuzzing on recognizers
make security # Bandit security scan
make check-compliance # Full compliance suiteThe GitHub Actions CI runs lint, typecheck, and security scanning in parallel, followed by the test suite and compliance checks. All steps use pinned dependency versions (uv sync --frozen).