Vegetable Detector: Real-Time Transfer Learning Detector

An end-to-end deep learning and computer vision pipeline that classifies 15 types of vegetables in real-time.

This project leverages an optimized Transfer Learning architecture to achieve 99.90% test accuracy, deployed via a custom OpenCV interface with temporal smoothing for highly stable live-webcam inference.

Dataset : https://www.kaggle.com/datasets/misrakahmed/vegetable-image-dataset

Model Architecture

The core of the classifier is a highly optimized Transfer Learning architecture. Rather than training from scratch, this pipeline leverages EfficientNetV2B0 as an advanced feature extractor and capped with a heavily regularized, custom-built classification head.

I've specifically chosen EfficientNetV2 for this project because it's a pretrained model where the base model was instantiated with weights pre-trained on the ImageNet dataset. During Phase 1 of training, these core layers were strictly frozen in order to protect model's foundational understanding of complex shapes, edges, and visual textures. Also, its parameter efficiency and real-time edge-device inference capabilities are off the roof & by utilizing Fused-MBConv blocks, EfficientNetV2 maximizes spatial feature extraction while minimizing computational latency, achieving high Frames Per Second (FPS) in the live webcam feed. Data Pipeline: 21,000 images across 15 classes, processed using TensorFlow's high-performance tf.data.AUTOTUNE pipeline for bottleneck-free loading. The original 1000-class ImageNet top was removed (include_top=False), and a custom regularized head was attached specifically for the 15 vegetable classes

GlobalAveragePooling2D: Replaces the traditional Flatten layer to heavily reduce the number of trainable parameters, making the model lighter and naturally resistant to overfitting.
Dense Layer 1: Exactly 256 neurons, designed to find complex non-linear combinations of the features extracted by EfficientNet.
Dense Layer 2: Exactly 128 neurons, continuing to distill the feature maps into highly specific vegetable profiles.
Regularization Matrix: To prevent the model from memorizing the training data, Batch Normalization (standardizing activations to prevent internal covariate shift) and Dropout Layers (stochastically severing neural connections) are interspersed between the Dense layers.
Terminal Output Layer: Exactly 15 neurons utilizing the SoftMax activation function to produce a normalized probability distribution across the 15 vegetable classes.

Training Methodology

Training a transfer learning model on 21,000 images required strict data management and phased training approach to prevent destroying the highly valuable pre-trained weights.

Dataset Split & Ingestion The model was trained on a strictly balanced, 21,000-image dataset (exactly 1,400 images per class), divided into three isolated partitions:
Training (70%): 15,000 images dedicated to gradient calculation and weight updates.
Validation (15%): 3,000 images utilized for active hyperparameter tuning, callback monitoring, and checkpointing.
Testing (15%): 3,000 images held in absolute isolation, used exclusively for generating final real-world performance metrics.
Optimization Directives: The compilation and training loop is mainly followed by strict mathematical directives to ensure convergence without overshooting the local minimums:
Optimizer: Adam (Adaptive Moment Estimation), combining the best properties of AdaGrad and RMSProp.
Loss Function: Categorical Cross-Entropy, paired perfectly with the one-hot encoded dataset.
Batch Size: 32 (Calculated as per local hardware).
Early Stopping: Monitored validation loss with a set patience threshold to halt training the moment the model began to overfit.
ReduceLROnPlateau: Dynamically lowered the learning rate by factors of 0.2, allowing the optimizer to settle into microscopic local minimums.
ModelCheckpoint: Silently monitored the epochs and permanently saved only the iteration with the highest validation accuracy, overwriting inferior versions.

Key Features

High-Accuracy Transfer Learning: Powered by a fine-tuned EfficientNetV2B0 architecture, utilizing pre-trained ImageNet weights for highly efficient feature extraction.
Smart UI & Temporal Smoothing: The OpenCV interface uses a 45-frame memory buffer (collections.deque) to "vote" on predictions, completely eliminating UI flickering caused by shadows or poor lighting.
Apple Silicon Optimized: Bypassed standard prediction memory leaks to achieve ultra-smooth 30+ FPS on M-Series Mac hardware.

Supported Classes

Classes: Bean, Bitter Gourd, Bottle Gourd, Brinjal, Broccoli, Cabbage, Capsicum, Carrot, Cauliflower, Cucumber, Papaya, Potato, Pumpkin, Radish, Tomato.

Engineering Challenges Overcome

The "Sim-to-Real" Gap: Realized that high dataset accuracy doesn't translate perfectly to webcams due to background noise and harsh lighting. Solved by implementing strict confidence thresholds (>85%) and a dynamic temporal voting system.
Dataset Bias: Identified regional variance issues (e.g., Indian cucumbers vs. Western cucumbers) causing misclassifications with bottle gourds, requiring careful environmental testing & proper lighting conditions to detect it correctly

Results & Evaluation

Model Performance: The EfficientNetV2B0 architecture achieved near-perfect classification metrics on the isolated test set of 3,000 images (representing 15% of the total dataset). The high test accuracy confirms that the heavy regularization strategy (Global Average Pooling + 40% Dropout) successfully prevented the model from overfitting to the training distribution.
Final Test Accuracy: 99.90%
Validation Loss: Stabilized near 0.01 after Phase 2 micro-fine-tuning.
Real-Time Model's Performance: The true test of the model was its deployment to a live webcam feed running on my Apple Silicon (M2 Pro) environment. By converting batch-dependent model.predict() calls into direct tensor execution, the pipeline achieved a rock-solid, crash-free 30+ Frames Per Second (FPS). Also, the system ran continuously without triggering the memory leak issues common to TensorFlow's Metal backend during continuous video loops.
The "Simulation to Real" Gap: During live testing, the model was subjected to challenging real-world environmental factors not present in the clean training dataset. Deep shadows initially caused false positives (e.g., classifying a shadowed tomato as a capsicum). So, to fix these issued I've implemented a 45-frame Temporal Smoothing Voting System, which required an 80% supermajority over 1.5 seconds to confirm a target, ignoring split-second lighting anomalies. By implementing a strict >85% confidence threshold, the model successfully ignored complex backgrounds (living rooms, human faces, blank walls) without "hallucinating" false vegetable classifications.
Regional Dataset Bias: Live testing identified a domain shift regarding physical vegetable variations (e.g., smooth Indian Kheera cucumbers vs. bumpy Western cucumbers). The temporal UI successfully stabilized predictions, but it highlighted an area for future dataset augmentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vegetable Detector: Real-Time Transfer Learning Detector

Model Architecture

Training Methodology

Key Features

Supported Classes

Engineering Challenges Overcome

Results & Evaluation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Vegetable Detector: Real-Time Transfer Learning Detector

Model Architecture

Training Methodology

Key Features

Supported Classes

Engineering Challenges Overcome

Results & Evaluation