Skip to content

dariusmark-tech/Developing-Multi-Input-Models-For-OCR

Repository files navigation

DigiNsure OCR — Multi-Modal Insurance Document Classifier

A PyTorch OCR model that classifies scanned insurance IDs as primary or secondary using both image data and insurance type — built for DigiNsure Inc.'s document digitisation initiative.


Table of Contents


Overview

DigiNsure Inc. is digitising historical insurance claim documents. This project trains a multi-modal OCR classifier that:

  • Takes a 64×64 greyscale scan of an insurance document as input
  • Also takes the insurance type (home, life, auto, health, or other) as a second input
  • Outputs whether the scanned ID is a primary_id or secondary_id

Multi-modal learning lets the model capture nuances that neither the image nor the type label could provide alone — for example, certain ID patterns may be exclusive to life or auto policies.


What I Learned

This was my first hands-on project combining computer vision and text features in a single PyTorch model. Here are the main takeaways:

Multi-Modal Fusion

Using two different data types (image + categorical text) as inputs to one model. Each modality gets its own processing branch, and the features are concatenated before the classifier — this is called late fusion.

Convolutional Layers

Conv2d slides learnable filters over an image to detect local patterns (edges, curves, character strokes). Key parameters I worked with:

Parameter Value Why
in_channels 1 Greyscale image (single channel)
out_channels 16 Learn 16 different feature detectors
kernel_size 3 Look at 3×3 pixel patches
padding 1 Preserve 64×64 spatial size after conv

Tensor Shape Tracking

One of the most important debugging habits — tracing shapes through every layer:

Input        →  (B, 1, 64, 64)
Conv2d       →  (B, 16, 64, 64)   # padding=1 preserves size
MaxPool2d(2) →  (B, 16, 32, 32)   # halves spatial dims
Flatten      →  (B, 16384)         # 16 × 32 × 32

Late Fusion with torch.cat

combined = torch.cat([img_feat, type_feat], dim=1)  # (B, 16400)

dim=1 concatenates along the feature dimension, not the batch — B stays the same.

Safe File Handling

# ✅ Always use context manager
with open('dataset.pkl', 'rb') as f:
    dataset = pickle.load(f)

# ❌ Avoid — leaves file handle open
dataset = pickle.load(open('dataset.pkl', 'rb'))

Dataset

Property Value
Total samples 100
Image size 64 × 64 pixels, greyscale
Insurance types home, life, auto, health, other
Type encoding One-hot vector of shape (5,)
Labels primary_id (0) / secondary_id (1)
Format .pkl — serialised ProjectDataset object

Each dataset item is a tuple (img_tensor, label) where:

  • img_tensor[0] — image channel, shape (1, 64, 64), normalised to [0, 1]
  • img_tensor[1] — one-hot type vector, shape (5,)

Model Architecture

┌─────────────────────────────┐     ┌────────────────────┐
│       IMAGE BRANCH          │     │    TYPE BRANCH     │
│                             │     │                    │
│  Input: (B, 1, 64, 64)      │     │  Input: (B, 5)     │
│         ↓                   │     │        ↓           │
│  Conv2d(1→16, k=3, p=1)     │     │  Linear(5→16)      │
│  ReLU                       │     │  ReLU              │
│  MaxPool2d(2)               │     │                    │
│  Flatten → (B, 16384)       │     │  → (B, 16)         │
└──────────────┬──────────────┘     └────────┬───────────┘
               │                             │
               └──────────┬──────────────────┘
                          ↓
              torch.cat([...], dim=1)
                    (B, 16400)
                          ↓
              Linear(16400 → 256)
                    ReLU
                  Dropout(0.3)
              Linear(256 → 2)
                          ↓
              Logits: (B, 2)
         [primary_id | secondary_id]

Why this design?

  • Separate branches — image and type are very different data types; processing them separately first gives each branch freedom to learn modality-specific features
  • Late fusion — concatenating after processing (not at the input) is standard for heterogeneous modalities
  • Dropout(0.3) — with only 100 training samples, regularisation is critical to prevent overfitting

Project Structure

📁 digiNsure-ocr/
├── 📓 notebook.ipynb              # Main Jupyter notebook
├── 🐍 project_utils.py            # ProjectDataset class
├── 📦 ocr_insurance_dataset.pkl   # Serialised dataset
└── 🖼️  digitizing_team.png        # Team asset

Setup & Usage

Requirements

pip install torch torchvision numpy matplotlib

Run the notebook

# Cell 1 — Load and visualise data
import pickle
from project_utils import ProjectDataset

with open('ocr_insurance_dataset.pkl', 'rb') as f:
    dataset = pickle.load(f)

show_dataset_images(dataset, num_images=5)
# Cell 2 — Define and verify the model
model = OCRModel()

dummy_img  = torch.zeros(4, 1, 64, 64)
dummy_type = torch.zeros(4, 5)
output     = model(dummy_img, dummy_type)

print(output.shape)   # torch.Size([4, 2]) ✅
# Cell 3 — Train
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    for images, types, labels in dataloader:
        optimizer.zero_grad()
        outputs = model(images, types)
        loss    = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

Key Concepts

Term Description
Multi-modal Model that accepts inputs from more than one data type
One-hot encoding Vector with a single 1 representing a category; all others 0
Conv2d 2D convolutional layer — slides learnable filters over an image
padding=1 Zero-border that preserves spatial size when kernel_size=3
MaxPool2d Downsampling via max value in each window — halves spatial dims
ReLU Activation: max(0, x) — introduces non-linearity
Flatten Reshapes (B, C, H, W)(B, C×H×W) for linear layers
Late fusion Combine modalities after separate processing, before classifier
Dropout Randomly zeroes neurons during training to reduce overfitting
Logits Raw model outputs before softmax — two values, one per class

Results

Training for 10 epochs using:

  • Optimiser: Adam (lr=1e-3)
  • Loss: CrossEntropyLoss
  • Batch size: configurable via DataLoader

💡 Tips & Gotchas

  • Always run a dummy forward pass after defining the model — catches shape errors before training
  • img_tensor[0] = image, img_tensor[1] = one-hot type — they are packed together per sample
  • torch.cat(dim=1) = join features, not batches
  • Dropout only activates during model.train() — remember to call model.eval() at inference
  • The image sequential must be named image_layer — naming matters for checkpointing

Built as part of the DigiNsure document digitisation initiative.

About

Optical Character Recognition (OCR) unlocks numerous possibilities in data management. By leveraging multi-modal modeling, paired with robust training and optimization, you'll deploy deep learning techniques to classify insurance codes from scanned insurance documents and develop an advanced OCR model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors