This repository contains a complete, modular PyTorch computer vision pipeline designed to classify scanned documents and images into 16 distinct categories (e.g., invoices, emails, resumes, etc.).
The project demonstrates a full deep learning workflow, starting from a custom Convolutional Neural Network (CNN) baseline and scaling up to state-of-the-art Transfer Learning architectures using Two-Phase Fine-Tuning.
This pipeline was used to benchmark three different network architectures against the same dataset:
- Custom CNN Baseline: A 3-block convolutional network built from scratch with Max Pooling and Dropout for baseline metric establishment.
- ResNet-34: Implementing Transfer Learning by replacing the ImageNet classification head, freezing the deep layers (Phase 1), and then unfreezing for full-network fine-tuning with a microscopic learning rate (Phase 2).
- EfficientNet-B3: Scaling up to a highly optimized, NAS-designed architecture to maximize validation accuracy using the same two-phase fine-tuning strategy.
├── data/
│ └── docs-sm/ # Dataset: 16 subfolders representing classes (Not included)
├── models/ # Saved .pth weights for best-performing models
├── src/
│ ├── config.py # Centralized hyperparameters and paths
│ ├── dataset.py # DataLoaders, Splits, and dynamic Augmentation wrappers
│ ├── custom_cnn.py # Baseline CNN architecture
│ ├── train.py # Universal training loop (Train + Validate)
│ ├── evaluate.py # Test-set scoring and Heatmap generation
│ └── inference.py # Single-image inference script for production
├── notebooks/ # Jupyter notebooks for sandbox experimentation
├── README.md
└── requirements.txt
Because documents rely on layout and structure rather than color, the data pipeline was heavily optimized to prevent the models from memorizing specific training images:
- Normalization: Standard image normalization is applied across the dataset.
- Subset-Specific Augmentation: a custom
ApplyTransformwrapper applies Random Rotation and Resized Cropping exclusively to the training set to prevent overfitting, leaving the Validation and Test sets completely clean and static.
All models were evaluated against the exact same holdout Test set.
| Architecture | Parameters | Final Test Accuracy | Notes |
|---|---|---|---|
| ResNet-34 | ~21 Million | 64.67% | Peak performance utilizing Two-Phase Fine-Tuning. |
| EfficientNet-B3 | ~12 Million | 60.13% | Highly capable, but likely required heavier dropout or hyperparameter tuning to prevent overfitting on this specific dataset. |
| Custom CNN | ~51.4 Million | 58.93% | Baseline. A massive, parameter-heavy architecture trained from scratch for only 15 epochs. |
Analysis: The performance gap between the custom baseline and the pre-trained architectures (ResNet/EfficientNet) is remarkably narrow, despite the Custom CNN having significantly more parameters (~51.4M vs ~21M). This highlights a classic case of Domain Shift. Because ImageNet-trained models are optimized for natural color photography, their highly-efficient pre-trained weights require heavy adaptation to understand the structural, grayscale layouts of documents. The Custom CNN, while brute-forcing the problem with a massive dense layer, learned domain-specific layout features immediately.
To break through the 65% accuracy ceiling, future iterations of this pipeline will focus on:
- Architectural Optimization: Replacing the massive
Flattenlayer in the Custom CNN withAdaptiveAvgPool2d(Global Average Pooling) to drastically reduce parameter count and prevent overfitting. - Vision Transformers (ViTs): Swapping standard CNNs for architectures like
ViTorSwin Transformerto capture global document context rather than local pixel textures. - Multimodal Document AI (Hugging Face): Moving beyond purely visual models to state-of-the-art Document Understanding models like LayoutLM or Donut, which utilize OCR to read the actual text on the page alongside the spatial layout.
- Hyperparameter Optimization: Implementing Learning Rate Schedulers (e.g., Cosine Annealing) and systematic Grid Search to find the mathematically optimal dropout rates and weight decay for Phase 2 fine-tuning.
To run the training loop (which automatically saves the best .pth weights based on Validation Accuracy):
python src/train.py
(Note: To train ResNet or EfficientNet, you can swap the model initialization inside the script or run the two-phase pipeline found in the experimental Jupyter notebooks).
To score a trained model against the unseen Test split and generate a Per-Class Accuracy breakdown alongside a Seaborn Confusion Matrix Heatmap:
python src/evaluate.py --model models/best_doc_model.pth
To predict the class of a brand-new, unseen document image:
python src/inference.py data/sample_invoice.jpg --model models/best_doc_model.pth
Output Example:
========================================
File: docs-sm/file_folder/10073626.jpg
Prediction: file_folder
Confidence: 52.16%
========================================
torch==2.10.0
torchvision==0.25.0
numpy==1.26.4
matplotlib==3.9.2
seaborn==0.13.2
scikit-learn==1.5.1
pillow==12.1.1
https://www.kaggle.com/datasets/shaz13/real-world-documents-collections/data