End-to-end document fraud detection pipeline.
Validates identity documents (passports) using computer vision, OCR, and deterministic rule engines.
┌─────────────────────────────────────────────────────────┐
│ API Layer │
│ FastAPI endpoints · Dark-mode Web UI · Pydantic DTOs │
├─────────────────────────────────────────────────────────┤
│ Core Layer │
│ Interfaces (Ports) · Use Cases · Domain Entities │
├─────────────────────────────────────────────────────────┤
│ Infrastructure Layer │
│ OpenCV Quality Gate · PaddleOCR Engine · Rules Engines │
│ PostgreSQL + pgvector · MinIO Storage · COCO Loader │
└─────────────────────────────────────────────────────────┘
Each document flows through 4 stages:
| Stage | Component | Description |
|---|---|---|
| 1. Quality Gate | OpenCVQualityGate |
Blur, brightness, resolution, framing checks (~5ms) |
| 2. OCR | EasyOCR + PaddleOCR v5 |
Dual-engine: EasyOCR primary (~1s/field), Paddle fallback (~3s/field) |
| 3. Rules Engine | PassportRulesEngine / BrazilianDocRulesEngine |
Deterministic validation with severity scoring |
| 4. Decision | AnalyzeDocumentUseCase |
Multi-signal aggregation → APPROVED / REVIEW / SUSPICIOUS / REJECTED |
10 rules for Machine Readable Travel Documents:
| # | Rule | Severity |
|---|---|---|
| 1 | MRZ format validation (TD3, 2×44 chars) | CRITICAL |
| 2 | Document number check digit | CRITICAL |
| 3 | Date of birth check digit | CRITICAL |
| 4 | Date of expiry check digit | CRITICAL |
| 5 | Personal number check digit | HIGH |
| 6 | Composite check digit | CRITICAL |
| 7 | Country code (ISO 3166-1 alpha-3) | HIGH |
| 8 | Date plausibility (DOB past, DOE reasonable) | CRITICAL/HIGH |
| 9 | Required fields presence | HIGH |
| 10 | VIZ ↔ MRZ cross-check (tampering detection) | CRITICAL |
Using MIDV-2020 passport subset in COCO format:
| Split | Images | Countries |
|---|---|---|
| Train | 320 | AZE, GRC, LVA, SRB |
| Valid | 80 | AZE, GRC, LVA, SRB |
| Test | 81 | Mixed |
| Total | 481 | 4 nationalities |
34 annotated field categories including MRZ lines, dates, document numbers, face, signature.
Tested on MIDV-2020 passports (4 countries):
| Engine | MRZ Checksum Accuracy | Avg Confidence | Speed/field |
|---|---|---|---|
| EasyOCR (raw + allowlist) | 9/10 (90%) | 0.50-1.00 | ~1.5s |
| PaddleOCR v5 (enable_mkldnn=False) | TBD (higher conf) | 0.94-1.00 | ~3-10s |
Strategy: EasyOCR primary (fast) → PaddleOCR fallback (accurate) → Manual review if both fail.
# Setup
python -m venv .venv
.venv\Scripts\activate # Windows
pip install -e .
pip install easyocr # Primary OCR engine
# Run OCR test on single passport
python scripts/test_ocr_single.py
# Run dual-engine comparison
python scripts/test_dual_ocr.py
# Run accuracy analysis (MRZ checksum validation)
python scripts/analyze_ocr_accuracy.py
# Run batch pipeline (no OCR, fast)
python scripts/process_dataset.py --split train --no-ocr
# Start API
uvicorn src.api.main:app --reloadsrc/
├── core/ # Domain layer (zero dependencies)
│ ├── interfaces/ # Ports: IQualityGate, IOCREngine, IRulesEngine...
│ ├── entities/ # Document, AnalysisResult
│ └── use_cases/ # AnalyzeDocumentUseCase
├── infrastructure/ # Adapters
│ ├── quality/ # OpenCV quality gate
│ ├── ocr/ # EasyOCR (primary) + PaddleOCR v5 (fallback)
│ ├── rules/ # Brazilian doc rules + Passport ICAO rules
│ ├── data/ # COCO dataset loader (MIDV-2020)
│ ├── db/ # SQLAlchemy + pgvector
│ ├── embeddings/ # PgVector similarity search
│ └── storage/ # MinIO object storage
├── api/ # FastAPI application
│ ├── routes/ # /analyze, /cases, /feedback
│ └── schemas/ # Pydantic request/response models
└── config/ # Settings (pydantic-settings)
scripts/
├── process_dataset.py # Batch pipeline processor
├── test_ocr_single.py # EasyOCR single passport test
├── test_dual_ocr.py # PaddleOCR vs EasyOCR comparison
└── analyze_ocr_accuracy.py # MRZ checksum accuracy analysis
data/
├── raw/ # MIDV-2020 dataset (train/valid/test)
└── results/ # Pipeline output JSONs
static/
└── index.html # Dark-mode Web UI
| Component | Code | Tested | Status |
|---|---|---|---|
| COCO DataLoader | ✅ | ✅ | ✅ Production-ready |
| Quality Gate | ✅ | ✅ | ✅ Production-ready |
| Passport Rules | ✅ | ✅ | ✅ Checksums validated via OCR |
| EasyOCR Engine | ✅ | ✅ | ✅ 90% MRZ accuracy |
| PaddleOCR v5 Fallback | ✅ | ✅ | ✅ Higher confidence, slower |
| Batch Processor | ✅ | ✅ | ✅ With OCR integration |
| API + Web UI | ✅ | ✅ | ✅ BR docs only |
- Clean Architecture skeleton
- Quality Gate (OpenCV)
- COCO DataLoader (MIDV-2020)
- Passport Rules Engine (ICAO 9303)
- Batch Pipeline (no OCR)
- EasyOCR integration + MRZ validation (9/10 checksums OK)
- PaddleOCR v5 dual-engine comparison
- Integrate dual-OCR into PassportOCREngine
- Run batch pipeline WITH OCR
- Validate rules engine with real OCR output
- Fraud Simulation — synthetic tampering on MIDV-2020
- Fraud Classifier — binary model
- LLM Integration — semantic anomaly analysis
- Docker Compose deployment =