An AI-powered forest monitoring system that detects illegal activities β chainsaw sounds, gunshots, and heavy machinery β in real time using deep learning audio classification and Gemini-powered visual analysis.
- Problem Statement
- System Architecture
- Dataset
- Classes Predicted
- Features Used
- Algorithms & Model Evolution
- CNN Architecture (Final)
- Training Strategy
- Results & Accuracy
- Project Structure
- How to Run
- Streamlit Web App
- Deployment
- Tech Stack
Illegal deforestation, poaching, and unlawful forest activity cause irreversible environmental damage. Manual forest patrols are:
- Expensive β require large teams across vast areas
- Reactive β threats are detected too late
- Impractical at scale β forests can span thousands of kilometres
Forest Guardian AI solves this by providing an automated, AI-powered monitoring system that:
- ποΈ Listens to forest audio and flags chainsaw sounds, gunshots, and heavy machinery
- πΌοΈ Analyzes camera trap images for visual signs of illegal activity
- β‘ Works in real-time with a confidence threshold to reduce false alarms
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Forest Guardian AI β
β (Streamlit App) β
ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
ββββββββββββββ΄βββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββ ββββββββββββββββββββββββ
β AUDIO MODULE β β IMAGE MODULE β
β audio_model/ β β image_model/ β
β β β β
β 1. Upload .wav β β 1. Upload image β
β 2. Extract Mel β β 2. Send to Gemini API β
β Spectrogram β β 3. Parse JSON threat β
β 3. Normalize β β assessment β
β 4. CNN Predict β β 4. Display result β
β 5. Show result β ββββββββββββββββββββββββ
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β CNN MODEL β Input: (64 Γ 130 Γ 1) Mel Spectrogram
β 4 Conv Blocks β Output: 4-class softmax
β ~4.5M params β Threshold: β₯ 65% confidence
βββββββββββββββββββ
Data Flow (Audio):
WAV File β Librosa Load β Mel Spectrogram (64Γ130) β Normalize β CNN β Softmax Probabilities β Prediction
Data Flow (Image):
Image Upload β PIL β Gemini 2.0 Flash API β JSON Response β Threat / Safe UI
| Property | Detail |
|---|---|
| Format | WAV, mono, 22,050 Hz, 3 seconds |
| Total classes | 4 core + 4 extended (future) |
| Balancing | Augmentation to equalise minority classes |
Sources explored:
- π ESC-50 Environmental Sound Dataset
- π UrbanSound8K
- π Custom-curated & recorded
.wavfiles - π Augmented copies of minority classes
Data Cleaning Rules:
- Format: WAV only (MP3 converted via
ffmpeg/librosa) - Duration: exactly 3 seconds (padded or trimmed)
- Sample rate: 22,050 Hz
- Channels: Mono (stereo mixed down)
| Class | Description | Threat Level |
|---|---|---|
πͺ chainsaw |
Electric / petrol chainsaw sounds | π΄ HIGH |
π« gunshot |
Single or burst gunfire | π΄ HIGH |
π heavy_machine |
Bulldozers, excavators, trucks | π΄ HIGH |
π normal |
Ambient forest β birds, wind, rain | π’ SAFE |
Predictions below 65% confidence are flagged as
OTHER / AMBIENTto prevent false alarms.
| Class | Description |
|---|---|
dangerous_animals |
Predator / dangerous wildlife calls |
wildlife_large |
Elephant, rhino, large animal movement |
mistake_mimicry |
Sounds that mimic threats but are not |
human_activity |
Footsteps, voices (non-illegal) |
Raw audio waveforms are converted into Mel Spectrograms β 2D image representations of sound that mimic how the human ear perceives frequency (logarithmic scale). This allows the CNN to treat audio classification as a visual pattern recognition problem.
SAMPLE_RATE = 22050 # Hz
DURATION = 3 # seconds
N_MELS = 64 # Mel filter banks β image height
MAX_LEN = 130 # time-axis columns β image widthEach audio clip becomes a (64 Γ 130) grayscale image fed into the CNN.
| Feature | MFCC | Mel Spectrogram |
|---|---|---|
| Information retained | Compressed | Full spectral detail |
| CNN suitability | Moderate | β Excellent (image-like) |
| Noise robustness | Good | Very good |
| Used in final model | Baseline only | β Yes |
| Technique | Details |
|---|---|
| Time stretching | Β±10% speed change |
| Pitch shifting | Β±2 semitones |
| Gaussian noise | Small random noise injection |
| Time shift | Circular roll of waveform Β±20% |
| SpecAugment | Random frequency + time band masking |
The model went through 4 major iterations:
| Version | Algorithm | Feature | Accuracy |
|---|---|---|---|
| Baseline | Logistic Regression | Raw MFCC | ~62% |
| v1 | Decision Tree | MFCC | ~72% |
| v1 | Random Forest | MFCC | ~78% |
| v2 (CNN v1) | 3-block CNN + Flatten | Mel Spectrogram | ~84.9% |
| v2 (Final) | 4-block CNN + GAP + AdamW + SpecAugment | Mel Spectrogram | 94.4% |
| Change | Impact |
|---|---|
Flatten β GlobalAveragePooling2D |
Reduced overfitting significantly |
| Added 4th Conv block (256 filters) | Deeper feature learning |
| SpecAugment during training | Better generalisation to real-world noise |
| AdamW optimizer (weight decay) | Prevented weight explosion |
| Label smoothing (0.1) | Prevented overconfident predictions |
Input: (64, 130, 1) β Mel Spectrogram as grayscale image
β
βββ Conv2D(32, 3Γ3, ReLU) β BatchNorm β MaxPool(2Γ2) β Dropout(0.25)
β
βββ Conv2D(64, 3Γ3, ReLU) β BatchNorm β MaxPool(2Γ2) β Dropout(0.25)
β
βββ Conv2D(128, 3Γ3, ReLU) β BatchNorm β MaxPool(2Γ2) β Dropout(0.25)
β
βββ Conv2D(256, 3Γ3, ReLU) β BatchNorm β GlobalAveragePooling2D
β
βββ Dense(256, ReLU) β Dropout(0.5)
βββ Dense(128, ReLU) β Dropout(0.3)
β
βββ Dense(4, Softmax) β Output: probability per class
Model stats:
- Total parameters: ~4.5 million
- Model size: ~52 MB (
.h5format) - Input shape:
(64, 130, 1) - Output: 4-class softmax
optimizer = AdamW(learning_rate=1e-3, weight_decay=1e-4)
loss = CategoricalCrossentropy(label_smoothing=0.1)
epochs = 50 # with early stopping
batch_size = 32
val_split = 0.20 # 80% train / 20% validationCallbacks:
| Callback | Config |
|---|---|
EarlyStopping |
patience=10, restore best weights |
ReduceLROnPlateau |
factor=0.5, patience=5, min_lr=1e-6 |
ModelCheckpoint |
saves best val_accuracy model |
Evaluated on a held-out test set of original (non-augmented) forest audio files:
| Metric | Value |
|---|---|
| Overall Accuracy | 94.4% |
| Weighted F1-Score | 0.944 |
| Chainsaw Recall | 95% |
| Gunshot Recall | 95% |
| Heavy Machine Recall | 93% |
| Normal Forest Recall | 95% |
Predictions below 65% confidence are shown as OTHER / AMBIENT β this prevents the model from making overconfident wrong calls on unknown sounds.
| Confidence | Label shown | Colour |
|---|---|---|
| β₯ 65% + threat class | Class name (e.g. CHAINSAW) | π΄ Red |
| β₯ 65% + normal | NORMAL | π’ Green |
| < 65% | OTHER / AMBIENT | β« Grey |
Audio_Classification_Model/
β
βββ audio_model/ # CNN audio classifier
β βββ __init__.py
β βββ classifier.py # extract_mel(), predict(), constants
β βββ forest_sound_model_v2.h5 # trained model weights (~52 MB, Git LFS)
β βββ norm_stats.npy # mean & std from training data
β
βββ image_model/ # Gemini visual analyzer
β βββ __init__.py
β βββ analyzer.py # analyze_image() via Gemini API
β
βββ app.py # Streamlit entry-point (UI only)
βββ requirements.txt # Python dependencies
βββ packages.txt # Linux system libs (Streamlit Cloud)
βββ .gitignore # excludes .env, __pycache__, etc.
βββ .gitattributes # Git LFS config for model files
βββ model_summary.md # Full technical walkthrough
βββ README.md
- Python 3.10+
- A Google Gemini API key β get one free at aistudio.google.com
git clone https://github.com/ishwarc04/Audio_Classification_Model.git
cd Audio_Classification_Modelpip install -r requirements.txtCreate a .env file in the project root:
GOOGLE_API_KEY=your_gemini_api_key_here
streamlit run app.pyOpen http://localhost:8501 in your browser.
The app has two tabs:
- Upload a
.wavfile - App extracts the Mel Spectrogram
- Spectrogram is normalized using saved
norm_stats.npy - CNN predicts class probabilities
- Results shown with colour-coded alert + full probability bars
- Upload a forest image (
.jpg,.png) - Image is sent to Gemini 2.0 Flash
- AI returns a structured JSON threat assessment
- Results shown with confidence score + threat/safe status
- Adjustable confidence threshold slider
Deployed on Streamlit Community Cloud (free tier).
- Push repo to a public GitHub repository
- Go to share.streamlit.io and log in with GitHub
- Select
app.pyas the entry point - Under App Settings β Secrets, add:
GOOGLE_API_KEY = "your_key_here"
- App is live at a public URL β no server management needed
Key deployment files:
| File | Purpose |
|---|---|
app.py |
Streamlit entry point |
audio_model/forest_sound_model_v2.h5 |
Trained model (Git LFS) |
audio_model/norm_stats.npy |
Normalization stats (Git LFS) |
requirements.txt |
Python packages |
packages.txt |
Linux libs (libsndfile1 for audio) |
| Tool | Purpose |
|---|---|
| Python 3.10+ | Core language |
| TensorFlow / Keras | CNN training & inference |
| Librosa | Audio loading & Mel Spectrogram extraction |
| NumPy / Scikit-learn | Data handling, metrics, class weights |
| Matplotlib | Spectrogram visualization |
| Streamlit | Web dashboard |
| Google Gemini 2.0 Flash | Image threat analysis (Vision AI) |
| Pillow | Image handling |
| python-dotenv | Local environment variable management |
| Git LFS | Large file storage for model weights |
Built for forest conservation and environmental protection. πΏ
If this project helped you, please β star the repository!