Skip to content

grohit1810/Cloak

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cloak - NER Extraction and Anonymization Pipeline

Python 3.10+

Cloak is an NER detection, redaction, and anonymization pipeline built on the GLiNER zero-shot model. It delivers multilingual entity extraction, runtime-configurable labels, numbered redaction placeholders with re-identification mapping, and Faker-powered PII replacements with document-wide consistency.

Features

  • Zero-shot NER — works with any entity type without retraining, powered by GLiNER
  • Zero-config setup — auto-downloads the default model from HuggingFace and exports to ONNX on first run
  • Dual backend — ONNX-optimized (default) or PyTorch inference via a single use_onnx flag
  • Multi-pass extraction — runs the model multiple times with decreasing confidence thresholds (0.5 -> 0.30), masking found entities between passes to discover more
  • Batched inference — automatic chunking with native GLiNER batched inference for large texts
  • Numbered redaction — consistent placeholders (#1_PERSON_REDACTED) with optional re-identification mapping
  • Synthetic replacement — Faker-powered realistic alternatives with pluggable strategies per entity type
  • Thread-safe — all shared state protected with locks; safe for concurrent use
  • Installable packagepip install, CLI entry point, pyproject.toml

Installation

# From source
git clone https://github.com/grohit1810/Cloak.git
cd Cloak
pip install -e ".[dev]"

# GPU support (optional)
pip install -e ".[gpu]"

Requirements

  • Python 3.10+
  • Dependencies are managed via pyproject.toml (numpy, onnxruntime, transformers, gliner, faker, torch)

Quick Start

Python API

import cloak

# Extract entities — no model_path needed, auto-downloads on first run
result = cloak.extract(
    "John works at Google Inc.",
    labels=["person", "company"],
)
print(result["entities"])
# [{"text": "John", "label": "person", "start": 0, "end": 4, "score": 0.95}, ...]

# Redact entities with numbered placeholders
result = cloak.redact(
    "Alice lives in Paris",
    labels=["person", "location"],
)
print(result["anonymized_text"])
# "#1_PERSON_REDACTED lives in #1_LOCATION_REDACTED"

# Replace with synthetic data (Faker-powered)
result = cloak.replace(
    "Bob Smith works at Microsoft",
    labels=["person", "company"],
)
print(result["anonymized_text"])
# "David Johnson works at TechCorp Inc"

# Replace with custom data
result = cloak.replace_with_data(
    "Jane works at Apple",
    labels=["person", "company"],
    user_replacements={"person": ["Anonymous User"], "company": "REDACTED_COMPANY"},
)

# Use a specific model or local path
result = cloak.extract(
    "text",
    labels=["person"],
    model_path="urchade/gliner_large-v2.1",  # HuggingFace ID or local path
    use_onnx=False,                           # Use PyTorch instead of ONNX
)

Command Line Interface

# Basic extraction — auto-downloads default model on first run
cloak --text "John works at Google" --labels person company

# Use a specific model
cloak --model urchade/gliner_large-v2.1 --text "John works at Google" --labels person company

# Redact with custom placeholder format
cloak --text "Alice and Bob work here" --redact --placeholder "#{id}_{label}_HIDDEN"

# Replace with synthetic data (PyTorch backend)
cloak --text-file input.txt --replace --no-onnx --labels person location date

# Chunked processing for large files
cloak --text-file large.txt --parallel --chunk-size 500

# Validation and overlap resolution
cloak --text "Text..." --overlap-strategy longest --verbose

Class-Based API (Full Control)

from cloak import CloakExtraction
from cloak.anonymization.redactor import EntityRedactor
from cloak.anonymization.replacer import EntityReplacer

# Create extraction pipeline with full configuration
pipeline = CloakExtraction(
    model_path="urchade/gliner_large-v2.1",
    use_onnx=True,           # ONNX backend (default) or PyTorch (False)
    use_caching=True,         # Cache extraction results
    min_confidence=0.3,       # Minimum entity confidence threshold
    overlap_strategy="highest_confidence",  # or "longest", "first"
)

# Extract
result = pipeline.extract_entities(
    "John Smith lives in Paris and works at Google.",
    labels=["person", "location", "organization"],
    max_passes=2,
    use_parallel=None,  # Auto-detect based on text length
)

# Then redact or replace independently
redactor = EntityRedactor()
redacted = redactor.redact(
    text="John Smith lives in Paris",
    entities=result["entities"],
    include_re_id_map=True,  # Opt-in: include reverse mapping
)

Architecture

Input Text
    |
    v
+---------------------+
|   Text Chunking     |  <-- Word-based chunks (600 words default)
|   (if large text)   |      Preserves character offsets
+---------------------+
    |
    v
+---------------------+
|  Batched NER        |  <-- GLiNER model (ONNX or PyTorch)
|  Inference          |      All chunks in single batched forward pass
|                     |      Multi-pass: threshold 0.5 then 0.3
+---------------------+
    |
    v
+---------------------+
|  Entity Validation  |  <-- Confidence filtering
|  & Overlap          |      Position/text consistency checks
|  Resolution         |      Sweep-line overlap resolution (3 strategies)
+---------------------+
    |
    v
+---------------------+
|  Entity Merging     |  <-- Adjacent entities with same label
|                     |      Weighted average scoring
+---------------------+
    |
    v
+---------------------+
|  Anonymization      |  <-- Redaction: segment-join numbered placeholders
|  (Optional)         |      Replacement: Faker / custom strategies
+---------------------+
    |
    v
Results + Analytics

Module Structure

cloak/
  __init__.py              # Public API re-exports
  api.py                   # Module-level functions (extract, redact, replace)
  constants.py             # Shared configuration constants
  extraction_pipeline.py   # CloakExtraction orchestrator
  cli.py                   # Command-line interface

  models/
    gliner_model.py        # GLiNER wrapper (ONNX + PyTorch, auto-download, thread-safe)

  extraction/
    extractor.py           # Multi-pass entity extraction with span-based masking
    parallel_processor.py  # Batched inference for chunked text
    chunker.py             # Word-based text chunking

  anonymization/
    redactor.py            # Numbered redaction with re-identification mapping
    replacer.py            # Synthetic replacement with strategy chain
    strategies/
      base.py              # ReplacementStrategy Protocol
      faker_strategy.py    # Faker-powered realistic data
      country_strategy.py  # Geographic data preservation
      date_strategy.py     # Date format-preserving replacement
      default_strategy.py  # Fallback character-pattern replacement

  utils/
    cache_manager.py       # LRU cache with analytics
    entity_validator.py    # Confidence, position, overlap validation
    merger.py              # Adjacent entity merging (thread-safe)

  data/
    countries.json         # Country replacement data
    replacements.json      # Custom replacement pools

Key Design Decisions

  • Dependency injection: GLiNERModel is created once and shared across extractors via batched inference.
  • Batched inference: Large texts are chunked and processed in a single batched model call, avoiding thread overhead.
  • Strategy pattern: Replacement strategies implement the ReplacementStrategy Protocol. New strategies can be added without modifying existing code.
  • Thread safety: All shared state (model inference, singletons, merge stats) is protected with threading.Lock.
  • Auto-download: On first run, the default model is downloaded from HuggingFace, exported to ONNX, and cached at ~/.cache/cloak/. Configurable via CLOAK_CACHE_DIR env var.

Configuration

Extraction Parameters

Parameter Default Description
model_path urchade/gliner_large-v2.1 HuggingFace model ID or local path (auto-downloaded on first run)
use_onnx True Use ONNX backend (faster) or PyTorch
max_passes 2 Multi-pass extraction rounds
min_confidence 0.3 Minimum entity confidence threshold
chunk_size 600 Words per chunk for batched processing
overlap_strategy "highest_confidence" How to resolve overlapping entities ("highest_confidence", "longest", "first")

Anonymization Options

Parameter Default Description
numbered True Use numbered placeholders (#1_PERSON_REDACTED)
placeholder_format "#{id}_{label}_REDACTED" Customizable placeholder template
ensure_consistency True Same entity text gets same replacement
include_re_id_map False Include reverse mapping (opt-in for security)
seed None Faker seed for reproducible replacements

Environment Variables

Variable Default Description
CLOAK_CACHE_DIR ~/.cache/cloak Directory for cached ONNX model exports

Development

# Setup
git clone https://github.com/grohit1810/Cloak.git
cd Cloak
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run tests with coverage
pytest tests/ --cov=cloak --cov-report=term-missing

# Lint and format
ruff check cloak/ tests/
ruff format cloak/ tests/

Acknowledgments

  • Built on the GLiNER architecture for zero-shot NER
  • Uses Faker for realistic synthetic data generation

About

Cloak: Multilingual zero-shot NER and anonymization toolkit featuring robust extraction, numbered redaction, and Faker‑powered replacements for privacy‑safe text processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages