Cloak - NER Extraction and Anonymization Pipeline

Cloak is an NER detection, redaction, and anonymization pipeline built on the GLiNER zero-shot model. It delivers multilingual entity extraction, runtime-configurable labels, numbered redaction placeholders with re-identification mapping, and Faker-powered PII replacements with document-wide consistency.

Features

Zero-shot NER — works with any entity type without retraining, powered by GLiNER
Zero-config setup — auto-downloads the default model from HuggingFace and exports to ONNX on first run
Dual backend — ONNX-optimized (default) or PyTorch inference via a single use_onnx flag
Multi-pass extraction — runs the model multiple times with decreasing confidence thresholds (0.5 -> 0.30), masking found entities between passes to discover more
Batched inference — automatic chunking with native GLiNER batched inference for large texts
Numbered redaction — consistent placeholders (#1_PERSON_REDACTED) with optional re-identification mapping
Synthetic replacement — Faker-powered realistic alternatives with pluggable strategies per entity type
Thread-safe — all shared state protected with locks; safe for concurrent use
Installable package — pip install, CLI entry point, pyproject.toml

Installation

# From source
git clone https://github.com/grohit1810/Cloak.git
cd Cloak
pip install -e ".[dev]"

# GPU support (optional)
pip install -e ".[gpu]"

Requirements

Python 3.10+
Dependencies are managed via pyproject.toml (numpy, onnxruntime, transformers, gliner, faker, torch)

Quick Start

Python API

import cloak

# Extract entities — no model_path needed, auto-downloads on first run
result = cloak.extract(
    "John works at Google Inc.",
    labels=["person", "company"],
)
print(result["entities"])
# [{"text": "John", "label": "person", "start": 0, "end": 4, "score": 0.95}, ...]

# Redact entities with numbered placeholders
result = cloak.redact(
    "Alice lives in Paris",
    labels=["person", "location"],
)
print(result["anonymized_text"])
# "#1_PERSON_REDACTED lives in #1_LOCATION_REDACTED"

# Replace with synthetic data (Faker-powered)
result = cloak.replace(
    "Bob Smith works at Microsoft",
    labels=["person", "company"],
)
print(result["anonymized_text"])
# "David Johnson works at TechCorp Inc"

# Replace with custom data
result = cloak.replace_with_data(
    "Jane works at Apple",
    labels=["person", "company"],
    user_replacements={"person": ["Anonymous User"], "company": "REDACTED_COMPANY"},
)

# Use a specific model or local path
result = cloak.extract(
    "text",
    labels=["person"],
    model_path="urchade/gliner_large-v2.1",  # HuggingFace ID or local path
    use_onnx=False,                           # Use PyTorch instead of ONNX
)

Command Line Interface

# Basic extraction — auto-downloads default model on first run
cloak --text "John works at Google" --labels person company

# Use a specific model
cloak --model urchade/gliner_large-v2.1 --text "John works at Google" --labels person company

# Redact with custom placeholder format
cloak --text "Alice and Bob work here" --redact --placeholder "#{id}_{label}_HIDDEN"

# Replace with synthetic data (PyTorch backend)
cloak --text-file input.txt --replace --no-onnx --labels person location date

# Chunked processing for large files
cloak --text-file large.txt --parallel --chunk-size 500

# Validation and overlap resolution
cloak --text "Text..." --overlap-strategy longest --verbose

Class-Based API (Full Control)

from cloak import CloakExtraction
from cloak.anonymization.redactor import EntityRedactor
from cloak.anonymization.replacer import EntityReplacer

# Create extraction pipeline with full configuration
pipeline = CloakExtraction(
    model_path="urchade/gliner_large-v2.1",
    use_onnx=True,           # ONNX backend (default) or PyTorch (False)
    use_caching=True,         # Cache extraction results
    min_confidence=0.3,       # Minimum entity confidence threshold
    overlap_strategy="highest_confidence",  # or "longest", "first"
)

# Extract
result = pipeline.extract_entities(
    "John Smith lives in Paris and works at Google.",
    labels=["person", "location", "organization"],
    max_passes=2,
    use_parallel=None,  # Auto-detect based on text length
)

# Then redact or replace independently
redactor = EntityRedactor()
redacted = redactor.redact(
    text="John Smith lives in Paris",
    entities=result["entities"],
    include_re_id_map=True,  # Opt-in: include reverse mapping
)

Architecture

Input Text
    |
    v
+---------------------+
|   Text Chunking     |  <-- Word-based chunks (600 words default)
|   (if large text)   |      Preserves character offsets
+---------------------+
    |
    v
+---------------------+
|  Batched NER        |  <-- GLiNER model (ONNX or PyTorch)
|  Inference          |      All chunks in single batched forward pass
|                     |      Multi-pass: threshold 0.5 then 0.3
+---------------------+
    |
    v
+---------------------+
|  Entity Validation  |  <-- Confidence filtering
|  & Overlap          |      Position/text consistency checks
|  Resolution         |      Sweep-line overlap resolution (3 strategies)
+---------------------+
    |
    v
+---------------------+
|  Entity Merging     |  <-- Adjacent entities with same label
|                     |      Weighted average scoring
+---------------------+
    |
    v
+---------------------+
|  Anonymization      |  <-- Redaction: segment-join numbered placeholders
|  (Optional)         |      Replacement: Faker / custom strategies
+---------------------+
    |
    v
Results + Analytics

Module Structure

cloak/
  __init__.py              # Public API re-exports
  api.py                   # Module-level functions (extract, redact, replace)
  constants.py             # Shared configuration constants
  extraction_pipeline.py   # CloakExtraction orchestrator
  cli.py                   # Command-line interface

  models/
    gliner_model.py        # GLiNER wrapper (ONNX + PyTorch, auto-download, thread-safe)

  extraction/
    extractor.py           # Multi-pass entity extraction with span-based masking
    parallel_processor.py  # Batched inference for chunked text
    chunker.py             # Word-based text chunking

  anonymization/
    redactor.py            # Numbered redaction with re-identification mapping
    replacer.py            # Synthetic replacement with strategy chain
    strategies/
      base.py              # ReplacementStrategy Protocol
      faker_strategy.py    # Faker-powered realistic data
      country_strategy.py  # Geographic data preservation
      date_strategy.py     # Date format-preserving replacement
      default_strategy.py  # Fallback character-pattern replacement

  utils/
    cache_manager.py       # LRU cache with analytics
    entity_validator.py    # Confidence, position, overlap validation
    merger.py              # Adjacent entity merging (thread-safe)

  data/
    countries.json         # Country replacement data
    replacements.json      # Custom replacement pools

Key Design Decisions

Dependency injection: GLiNERModel is created once and shared across extractors via batched inference.
Batched inference: Large texts are chunked and processed in a single batched model call, avoiding thread overhead.
Strategy pattern: Replacement strategies implement the ReplacementStrategy Protocol. New strategies can be added without modifying existing code.
Thread safety: All shared state (model inference, singletons, merge stats) is protected with threading.Lock.
Auto-download: On first run, the default model is downloaded from HuggingFace, exported to ONNX, and cached at ~/.cache/cloak/. Configurable via CLOAK_CACHE_DIR env var.

Configuration

Extraction Parameters

Parameter	Default	Description
`model_path`	`urchade/gliner_large-v2.1`	HuggingFace model ID or local path (auto-downloaded on first run)
`use_onnx`	`True`	Use ONNX backend (faster) or PyTorch
`max_passes`	`2`	Multi-pass extraction rounds
`min_confidence`	`0.3`	Minimum entity confidence threshold
`chunk_size`	`600`	Words per chunk for batched processing
`overlap_strategy`	`"highest_confidence"`	How to resolve overlapping entities (`"highest_confidence"`, `"longest"`, `"first"`)

Anonymization Options

Parameter	Default	Description
`numbered`	`True`	Use numbered placeholders (`#1_PERSON_REDACTED`)
`placeholder_format`	`"#{id}_{label}_REDACTED"`	Customizable placeholder template
`ensure_consistency`	`True`	Same entity text gets same replacement
`include_re_id_map`	`False`	Include reverse mapping (opt-in for security)
`seed`	`None`	Faker seed for reproducible replacements

Environment Variables

Variable	Default	Description
`CLOAK_CACHE_DIR`	`~/.cache/cloak`	Directory for cached ONNX model exports

Development

# Setup
git clone https://github.com/grohit1810/Cloak.git
cd Cloak
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run tests with coverage
pytest tests/ --cov=cloak --cov-report=term-missing

# Lint and format
ruff check cloak/ tests/
ruff format cloak/ tests/

Acknowledgments

Built on the GLiNER architecture for zero-shot NER
Uses Faker for realistic synthetic data generation

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
cloak		cloak
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloak - NER Extraction and Anonymization Pipeline

Features

Installation

Requirements

Quick Start

Python API

Command Line Interface

Class-Based API (Full Control)

Architecture

Module Structure

Key Design Decisions

Configuration

Extraction Parameters

Anonymization Options

Environment Variables

Development

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cloak - NER Extraction and Anonymization Pipeline

Features

Installation

Requirements

Quick Start

Python API

Command Line Interface

Class-Based API (Full Control)

Architecture

Module Structure

Key Design Decisions

Configuration

Extraction Parameters

Anonymization Options

Environment Variables

Development

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages