Document Image Classification Pipeline

This repository contains a complete, modular PyTorch computer vision pipeline designed to classify scanned documents and images into 16 distinct categories (e.g., invoices, emails, resumes, etc.).

The project demonstrates a full deep learning workflow, starting from a custom Convolutional Neural Network (CNN) baseline and scaling up to state-of-the-art Transfer Learning architectures using Two-Phase Fine-Tuning.

Models Implemented

This pipeline was used to benchmark three different network architectures against the same dataset:

Custom CNN Baseline: A 3-block convolutional network built from scratch with Max Pooling and Dropout for baseline metric establishment.
ResNet-34: Implementing Transfer Learning by replacing the ImageNet classification head, freezing the deep layers (Phase 1), and then unfreezing for full-network fine-tuning with a microscopic learning rate (Phase 2).
EfficientNet-B3: Scaling up to a highly optimized, NAS-designed architecture to maximize validation accuracy using the same two-phase fine-tuning strategy.

Repository Structure

├── data/
│   └── docs-sm/            # Dataset: 16 subfolders representing classes (Not included)
├── models/                 # Saved .pth weights for best-performing models
├── src/
│   ├── config.py           # Centralized hyperparameters and paths
│   ├── dataset.py          # DataLoaders, Splits, and dynamic Augmentation wrappers
│   ├── custom_cnn.py       # Baseline CNN architecture
│   ├── train.py            # Universal training loop (Train + Validate)
│   ├── evaluate.py         # Test-set scoring and Heatmap generation
│   └── inference.py        # Single-image inference script for production
├── notebooks/              # Jupyter notebooks for sandbox experimentation
├── README.md               
└── requirements.txt

The Data Pipeline

Because documents rely on layout and structure rather than color, the data pipeline was heavily optimized to prevent the models from memorizing specific training images:

Normalization: Standard image normalization is applied across the dataset.
Subset-Specific Augmentation: a custom ApplyTransform wrapper applies Random Rotation and Resized Cropping exclusively to the training set to prevent overfitting, leaving the Validation and Test sets completely clean and static.

Performance Benchmarks

All models were evaluated against the exact same holdout Test set.

Architecture	Parameters	Final Test Accuracy	Notes
ResNet-34	~21 Million	64.67%	Peak performance utilizing Two-Phase Fine-Tuning.
EfficientNet-B3	~12 Million	60.13%	Highly capable, but likely required heavier dropout or hyperparameter tuning to prevent overfitting on this specific dataset.
Custom CNN	~51.4 Million	58.93%	Baseline. A massive, parameter-heavy architecture trained from scratch for only 15 epochs.

Analysis: The performance gap between the custom baseline and the pre-trained architectures (ResNet/EfficientNet) is remarkably narrow, despite the Custom CNN having significantly more parameters (~51.4M vs ~21M). This highlights a classic case of Domain Shift. Because ImageNet-trained models are optimized for natural color photography, their highly-efficient pre-trained weights require heavy adaptation to understand the structural, grayscale layouts of documents. The Custom CNN, while brute-forcing the problem with a massive dense layer, learned domain-specific layout features immediately.

Future Work & Next Steps

To break through the 65% accuracy ceiling, future iterations of this pipeline will focus on:

Architectural Optimization: Replacing the massive Flatten layer in the Custom CNN with AdaptiveAvgPool2d (Global Average Pooling) to drastically reduce parameter count and prevent overfitting.
Vision Transformers (ViTs): Swapping standard CNNs for architectures like ViT or Swin Transformer to capture global document context rather than local pixel textures.
Multimodal Document AI (Hugging Face): Moving beyond purely visual models to state-of-the-art Document Understanding models like LayoutLM or Donut, which utilize OCR to read the actual text on the page alongside the spatial layout.
Hyperparameter Optimization: Implementing Learning Rate Schedulers (e.g., Cosine Annealing) and systematic Grid Search to find the mathematically optimal dropout rates and weight decay for Phase 2 fine-tuning.

How to Run

1. Training a Model

To run the training loop (which automatically saves the best .pth weights based on Validation Accuracy):

python src/train.py

(Note: To train ResNet or EfficientNet, you can swap the model initialization inside the script or run the two-phase pipeline found in the experimental Jupyter notebooks).

2. Evaluating on the Test Set

To score a trained model against the unseen Test split and generate a Per-Class Accuracy breakdown alongside a Seaborn Confusion Matrix Heatmap:

python src/evaluate.py --model models/best_doc_model.pth

3. Single Image Inference

To predict the class of a brand-new, unseen document image:

python src/inference.py data/sample_invoice.jpg --model models/best_doc_model.pth

Output Example:

========================================
File:       docs-sm/file_folder/10073626.jpg
Prediction: file_folder
Confidence: 52.16%
========================================

Requirements

torch==2.10.0
torchvision==0.25.0
numpy==1.26.4
matplotlib==3.9.2
seaborn==0.13.2
scikit-learn==1.5.1
pillow==12.1.1

Data used

https://www.kaggle.com/datasets/shaz13/real-world-documents-collections/data

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.vscode		.vscode
notebook		notebook
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Image Classification Pipeline

Models Implemented

Repository Structure

The Data Pipeline

Performance Benchmarks

Future Work & Next Steps

How to Run

1. Training a Model

2. Evaluating on the Test Set

3. Single Image Inference

Requirements

Data used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document Image Classification Pipeline

Models Implemented

Repository Structure

The Data Pipeline

Performance Benchmarks

Future Work & Next Steps

How to Run

1. Training a Model

2. Evaluating on the Test Set

3. Single Image Inference

Requirements

Data used

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages