An end-to-end deep learning and computer vision pipeline that classifies 15 types of vegetables in real-time.
This project leverages an optimized Transfer Learning architecture to achieve 99.90% test accuracy, deployed via a custom OpenCV interface with temporal smoothing for highly stable live-webcam inference.
Dataset : https://www.kaggle.com/datasets/misrakahmed/vegetable-image-dataset
The core of the classifier is a highly optimized Transfer Learning architecture. Rather than training from scratch, this pipeline leverages EfficientNetV2B0 as an advanced feature extractor and capped with a heavily regularized, custom-built classification head.
I've specifically chosen EfficientNetV2 for this project because it's a pretrained model where the base model was instantiated with weights pre-trained on the ImageNet dataset. During Phase 1 of training, these core layers were strictly frozen in order to protect model's foundational understanding of complex shapes, edges, and visual textures. Also, its parameter efficiency and real-time edge-device inference capabilities are off the roof & by utilizing Fused-MBConv blocks, EfficientNetV2 maximizes spatial feature extraction while minimizing computational latency, achieving high Frames Per Second (FPS) in the live webcam feed.
Data Pipeline: 21,000 images across 15 classes, processed using TensorFlow's high-performance tf.data.AUTOTUNE pipeline for bottleneck-free loading.
The original 1000-class ImageNet top was removed (include_top=False), and a custom regularized head was attached specifically for the 15 vegetable classes
- GlobalAveragePooling2D: Replaces the traditional Flatten layer to heavily reduce the number of trainable parameters, making the model lighter and naturally resistant to overfitting.
- Dense Layer 1: Exactly 256 neurons, designed to find complex non-linear combinations of the features extracted by EfficientNet.
- Dense Layer 2: Exactly 128 neurons, continuing to distill the feature maps into highly specific vegetable profiles.
- Regularization Matrix: To prevent the model from memorizing the training data, Batch Normalization (standardizing activations to prevent internal covariate shift) and Dropout Layers (stochastically severing neural connections) are interspersed between the Dense layers.
- Terminal Output Layer: Exactly 15 neurons utilizing the SoftMax activation function to produce a normalized probability distribution across the 15 vegetable classes.
Training a transfer learning model on 21,000 images required strict data management and phased training approach to prevent destroying the highly valuable pre-trained weights.
-
Dataset Split & Ingestion The model was trained on a strictly balanced, 21,000-image dataset (exactly 1,400 images per class), divided into three isolated partitions:
-
Training (70%): 15,000 images dedicated to gradient calculation and weight updates.
-
Validation (15%): 3,000 images utilized for active hyperparameter tuning, callback monitoring, and checkpointing.
-
Testing (15%): 3,000 images held in absolute isolation, used exclusively for generating final real-world performance metrics.
-
Optimization Directives: The compilation and training loop is mainly followed by strict mathematical directives to ensure convergence without overshooting the local minimums:
-
Optimizer: Adam (Adaptive Moment Estimation), combining the best properties of AdaGrad and RMSProp.
-
Loss Function: Categorical Cross-Entropy, paired perfectly with the one-hot encoded dataset.
-
Batch Size: 32 (Calculated as per local hardware).
-
Early Stopping: Monitored validation loss with a set patience threshold to halt training the moment the model began to overfit.
-
ReduceLROnPlateau: Dynamically lowered the learning rate by factors of 0.2, allowing the optimizer to settle into microscopic local minimums.
-
ModelCheckpoint: Silently monitored the epochs and permanently saved only the iteration with the highest validation accuracy, overwriting inferior versions.
- High-Accuracy Transfer Learning: Powered by a fine-tuned
EfficientNetV2B0architecture, utilizing pre-trained ImageNet weights for highly efficient feature extraction. - Smart UI & Temporal Smoothing: The OpenCV interface uses a 45-frame memory buffer (
collections.deque) to "vote" on predictions, completely eliminating UI flickering caused by shadows or poor lighting. - Apple Silicon Optimized: Bypassed standard prediction memory leaks to achieve ultra-smooth 30+ FPS on M-Series Mac hardware.
- Classes: Bean, Bitter Gourd, Bottle Gourd, Brinjal, Broccoli, Cabbage, Capsicum, Carrot, Cauliflower, Cucumber, Papaya, Potato, Pumpkin, Radish, Tomato.
- The "Sim-to-Real" Gap: Realized that high dataset accuracy doesn't translate perfectly to webcams due to background noise and harsh lighting. Solved by implementing strict confidence thresholds (>85%) and a dynamic temporal voting system.
- Dataset Bias: Identified regional variance issues (e.g., Indian cucumbers vs. Western cucumbers) causing misclassifications with bottle gourds, requiring careful environmental testing & proper lighting conditions to detect it correctly
-
Model Performance: The EfficientNetV2B0 architecture achieved near-perfect classification metrics on the isolated test set of 3,000 images (representing 15% of the total dataset). The high test accuracy confirms that the heavy regularization strategy (Global Average Pooling + 40% Dropout) successfully prevented the model from overfitting to the training distribution.
-
Final Test Accuracy: 99.90%
-
Validation Loss: Stabilized near 0.01 after Phase 2 micro-fine-tuning.
-
Real-Time Model's Performance: The true test of the model was its deployment to a live webcam feed running on my Apple Silicon (M2 Pro) environment. By converting batch-dependent model.predict() calls into direct tensor execution, the pipeline achieved a rock-solid, crash-free 30+ Frames Per Second (FPS). Also, the system ran continuously without triggering the memory leak issues common to TensorFlow's Metal backend during continuous video loops.
-
The "Simulation to Real" Gap: During live testing, the model was subjected to challenging real-world environmental factors not present in the clean training dataset. Deep shadows initially caused false positives (e.g., classifying a shadowed tomato as a capsicum). So, to fix these issued I've implemented a 45-frame Temporal Smoothing Voting System, which required an 80% supermajority over 1.5 seconds to confirm a target, ignoring split-second lighting anomalies. By implementing a strict >85% confidence threshold, the model successfully ignored complex backgrounds (living rooms, human faces, blank walls) without "hallucinating" false vegetable classifications.
-
Regional Dataset Bias: Live testing identified a domain shift regarding physical vegetable variations (e.g., smooth Indian Kheera cucumbers vs. bumpy Western cucumbers). The temporal UI successfully stabilized predictions, but it highlighted an area for future dataset augmentation.