Skip to content

ishwarc04/Audio_Classification_Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌳 Forest Guardian AI

Real-time Forest Sound & Image Threat Detection

Python TensorFlow Streamlit Gemini License

An AI-powered forest monitoring system that detects illegal activities β€” chainsaw sounds, gunshots, and heavy machinery β€” in real time using deep learning audio classification and Gemini-powered visual analysis.


πŸ“‹ Table of Contents

  1. Problem Statement
  2. System Architecture
  3. Dataset
  4. Classes Predicted
  5. Features Used
  6. Algorithms & Model Evolution
  7. CNN Architecture (Final)
  8. Training Strategy
  9. Results & Accuracy
  10. Project Structure
  11. How to Run
  12. Streamlit Web App
  13. Deployment
  14. Tech Stack

🚨 Problem Statement

Illegal deforestation, poaching, and unlawful forest activity cause irreversible environmental damage. Manual forest patrols are:

  • Expensive β€” require large teams across vast areas
  • Reactive β€” threats are detected too late
  • Impractical at scale β€” forests can span thousands of kilometres

Forest Guardian AI solves this by providing an automated, AI-powered monitoring system that:

  • πŸŽ™οΈ Listens to forest audio and flags chainsaw sounds, gunshots, and heavy machinery
  • πŸ–ΌοΈ Analyzes camera trap images for visual signs of illegal activity
  • ⚑ Works in real-time with a confidence threshold to reduce false alarms

πŸ— System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Forest Guardian AI                      β”‚
β”‚                    (Streamlit App)                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                         β”‚
          β–Ό                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  AUDIO MODULE   β”‚       β”‚    IMAGE MODULE       β”‚
β”‚ audio_model/    β”‚       β”‚   image_model/        β”‚
β”‚                 β”‚       β”‚                       β”‚
β”‚ 1. Upload .wav  β”‚       β”‚ 1. Upload image       β”‚
β”‚ 2. Extract Mel  β”‚       β”‚ 2. Send to Gemini API β”‚
β”‚    Spectrogram  β”‚       β”‚ 3. Parse JSON threat  β”‚
β”‚ 3. Normalize    β”‚       β”‚    assessment         β”‚
β”‚ 4. CNN Predict  β”‚       β”‚ 4. Display result     β”‚
β”‚ 5. Show result  β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  CNN MODEL      β”‚    Input: (64 Γ— 130 Γ— 1) Mel Spectrogram
β”‚  4 Conv Blocks  β”‚    Output: 4-class softmax
β”‚  ~4.5M params   β”‚    Threshold: β‰₯ 65% confidence
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Flow (Audio):

WAV File β†’ Librosa Load β†’ Mel Spectrogram (64Γ—130) β†’ Normalize β†’ CNN β†’ Softmax Probabilities β†’ Prediction

Data Flow (Image):

Image Upload β†’ PIL β†’ Gemini 2.0 Flash API β†’ JSON Response β†’ Threat / Safe UI

πŸ“¦ Dataset

Property Detail
Format WAV, mono, 22,050 Hz, 3 seconds
Total classes 4 core + 4 extended (future)
Balancing Augmentation to equalise minority classes

Sources explored:

Data Cleaning Rules:

  • Format: WAV only (MP3 converted via ffmpeg / librosa)
  • Duration: exactly 3 seconds (padded or trimmed)
  • Sample rate: 22,050 Hz
  • Channels: Mono (stereo mixed down)

🎯 Classes Predicted

Core Classes (trained)

Class Description Threat Level
πŸͺš chainsaw Electric / petrol chainsaw sounds πŸ”΄ HIGH
πŸ”« gunshot Single or burst gunfire πŸ”΄ HIGH
🚜 heavy_machine Bulldozers, excavators, trucks πŸ”΄ HIGH
πŸƒ normal Ambient forest β€” birds, wind, rain 🟒 SAFE

Predictions below 65% confidence are flagged as OTHER / AMBIENT to prevent false alarms.

Extended Classes (dataset collected, future training)

Class Description
dangerous_animals Predator / dangerous wildlife calls
wildlife_large Elephant, rhino, large animal movement
mistake_mimicry Sounds that mimic threats but are not
human_activity Footsteps, voices (non-illegal)

πŸ”¬ Features Used

Primary: Mel Spectrogram

Raw audio waveforms are converted into Mel Spectrograms β€” 2D image representations of sound that mimic how the human ear perceives frequency (logarithmic scale). This allows the CNN to treat audio classification as a visual pattern recognition problem.

SAMPLE_RATE = 22050   # Hz
DURATION    = 3       # seconds
N_MELS      = 64      # Mel filter banks  β†’ image height
MAX_LEN     = 130     # time-axis columns β†’ image width

Each audio clip becomes a (64 Γ— 130) grayscale image fed into the CNN.

Why Mel Spectrograms over raw MFCCs?

Feature MFCC Mel Spectrogram
Information retained Compressed Full spectral detail
CNN suitability Moderate βœ… Excellent (image-like)
Noise robustness Good Very good
Used in final model Baseline only βœ… Yes

Data Augmentation (applied during training)

Technique Details
Time stretching Β±10% speed change
Pitch shifting Β±2 semitones
Gaussian noise Small random noise injection
Time shift Circular roll of waveform Β±20%
SpecAugment Random frequency + time band masking

πŸ€– Algorithms & Model Evolution

The model went through 4 major iterations:

Version Algorithm Feature Accuracy
Baseline Logistic Regression Raw MFCC ~62%
v1 Decision Tree MFCC ~72%
v1 Random Forest MFCC ~78%
v2 (CNN v1) 3-block CNN + Flatten Mel Spectrogram ~84.9%
v2 (Final) 4-block CNN + GAP + AdamW + SpecAugment Mel Spectrogram 94.4%

Key Improvements in Final CNN

Change Impact
Flatten β†’ GlobalAveragePooling2D Reduced overfitting significantly
Added 4th Conv block (256 filters) Deeper feature learning
SpecAugment during training Better generalisation to real-world noise
AdamW optimizer (weight decay) Prevented weight explosion
Label smoothing (0.1) Prevented overconfident predictions

🧠 CNN Architecture (Final)

Input: (64, 130, 1)  ←  Mel Spectrogram as grayscale image
β”‚
β”œβ”€β”€ Conv2D(32, 3Γ—3, ReLU) β†’ BatchNorm β†’ MaxPool(2Γ—2) β†’ Dropout(0.25)
β”‚
β”œβ”€β”€ Conv2D(64, 3Γ—3, ReLU) β†’ BatchNorm β†’ MaxPool(2Γ—2) β†’ Dropout(0.25)
β”‚
β”œβ”€β”€ Conv2D(128, 3Γ—3, ReLU) β†’ BatchNorm β†’ MaxPool(2Γ—2) β†’ Dropout(0.25)
β”‚
β”œβ”€β”€ Conv2D(256, 3Γ—3, ReLU) β†’ BatchNorm β†’ GlobalAveragePooling2D
β”‚
β”œβ”€β”€ Dense(256, ReLU) β†’ Dropout(0.5)
β”œβ”€β”€ Dense(128, ReLU) β†’ Dropout(0.3)
β”‚
└── Dense(4, Softmax)  ←  Output: probability per class

Model stats:

  • Total parameters: ~4.5 million
  • Model size: ~52 MB (.h5 format)
  • Input shape: (64, 130, 1)
  • Output: 4-class softmax

βš™οΈ Training Strategy

optimizer  = AdamW(learning_rate=1e-3, weight_decay=1e-4)
loss       = CategoricalCrossentropy(label_smoothing=0.1)
epochs     = 50          # with early stopping
batch_size = 32
val_split  = 0.20        # 80% train / 20% validation

Callbacks:

Callback Config
EarlyStopping patience=10, restore best weights
ReduceLROnPlateau factor=0.5, patience=5, min_lr=1e-6
ModelCheckpoint saves best val_accuracy model

πŸ“Š Results & Accuracy

Evaluated on a held-out test set of original (non-augmented) forest audio files:

Metric Value
Overall Accuracy 94.4%
Weighted F1-Score 0.944
Chainsaw Recall 95%
Gunshot Recall 95%
Heavy Machine Recall 93%
Normal Forest Recall 95%

Confidence Threshold

Predictions below 65% confidence are shown as OTHER / AMBIENT β€” this prevents the model from making overconfident wrong calls on unknown sounds.

Confidence Label shown Colour
β‰₯ 65% + threat class Class name (e.g. CHAINSAW) πŸ”΄ Red
β‰₯ 65% + normal NORMAL 🟒 Green
< 65% OTHER / AMBIENT ⚫ Grey

πŸ“ Project Structure

Audio_Classification_Model/
β”‚
β”œβ”€β”€ audio_model/                  # CNN audio classifier
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ classifier.py             # extract_mel(), predict(), constants
β”‚   β”œβ”€β”€ forest_sound_model_v2.h5  # trained model weights (~52 MB, Git LFS)
β”‚   └── norm_stats.npy            # mean & std from training data
β”‚
β”œβ”€β”€ image_model/                  # Gemini visual analyzer
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── analyzer.py               # analyze_image() via Gemini API
β”‚
β”œβ”€β”€ app.py                        # Streamlit entry-point (UI only)
β”œβ”€β”€ requirements.txt              # Python dependencies
β”œβ”€β”€ packages.txt                  # Linux system libs (Streamlit Cloud)
β”œβ”€β”€ .gitignore                    # excludes .env, __pycache__, etc.
β”œβ”€β”€ .gitattributes                # Git LFS config for model files
β”œβ”€β”€ model_summary.md              # Full technical walkthrough
└── README.md

πŸš€ How to Run

Prerequisites

1. Clone the repository

git clone https://github.com/ishwarc04/Audio_Classification_Model.git
cd Audio_Classification_Model

2. Install dependencies

pip install -r requirements.txt

3. Set your API key

Create a .env file in the project root:

GOOGLE_API_KEY=your_gemini_api_key_here

4. Run the app

streamlit run app.py

Open http://localhost:8501 in your browser.


πŸ–₯️ Streamlit Web App

The app has two tabs:

πŸŽ™οΈ Audio Monitoring Tab

  1. Upload a .wav file
  2. App extracts the Mel Spectrogram
  3. Spectrogram is normalized using saved norm_stats.npy
  4. CNN predicts class probabilities
  5. Results shown with colour-coded alert + full probability bars

πŸ–ΌοΈ Image Monitoring Tab

  1. Upload a forest image (.jpg, .png)
  2. Image is sent to Gemini 2.0 Flash
  3. AI returns a structured JSON threat assessment
  4. Results shown with confidence score + threat/safe status
  5. Adjustable confidence threshold slider

☁️ Deployment

Deployed on Streamlit Community Cloud (free tier).

Steps to deploy your own

  1. Push repo to a public GitHub repository
  2. Go to share.streamlit.io and log in with GitHub
  3. Select app.py as the entry point
  4. Under App Settings β†’ Secrets, add:
    GOOGLE_API_KEY = "your_key_here"
  5. App is live at a public URL β€” no server management needed

Key deployment files:

File Purpose
app.py Streamlit entry point
audio_model/forest_sound_model_v2.h5 Trained model (Git LFS)
audio_model/norm_stats.npy Normalization stats (Git LFS)
requirements.txt Python packages
packages.txt Linux libs (libsndfile1 for audio)

πŸ›  Tech Stack

Tool Purpose
Python 3.10+ Core language
TensorFlow / Keras CNN training & inference
Librosa Audio loading & Mel Spectrogram extraction
NumPy / Scikit-learn Data handling, metrics, class weights
Matplotlib Spectrogram visualization
Streamlit Web dashboard
Google Gemini 2.0 Flash Image threat analysis (Vision AI)
Pillow Image handling
python-dotenv Local environment variable management
Git LFS Large file storage for model weights

Built for forest conservation and environmental protection. 🌿

If this project helped you, please ⭐ star the repository!

About

An AI-powered audio classification system using Deep Learning (CNN) to detect illegal chainsaws, gunshots, and machinery sounds to protect forest ecosystems from poaching and logging.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages