A complete, step-by-step walkthrough of how the audio classification model was built from scratch.
- Problem Definition
- Tech Stack
- Dataset Collection & Classes
- Data Cleaning & Preparation
- Feature Extraction — Mel Spectrograms
- Model Evolution
- CNN Architecture (Final — v2)
- Training Strategy
- Normalization & Saving Stats
- Evaluation & Results
- Streamlit Web App
- Deployment
Goal: Build an AI system that listens to forest sounds and automatically detects illegal human activity (chainsaw, gunshot, heavy machinery) vs. normal forest ambience.
Why AI? Manual forest monitoring is expensive and impractical at scale. An AI model can process audio in real-time and trigger alerts automatically.
| Tool | Purpose |
|---|---|
| Python 3.10+ | Core language |
| TensorFlow / Keras | Deep Learning framework |
| Librosa | Audio loading & Mel-Spectrogram extraction |
| NumPy / Scikit-learn | Data handling, normalization, metrics |
| Matplotlib | Visualization |
| Streamlit | Web dashboard / UI |
| tqdm | Progress bars during training |
Install all dependencies:
pip install -r requirements.txtThe model was trained on 4 core sound classes:
| Class | Description | Threat Level |
|---|---|---|
chainsaw |
Electric/petrol chainsaw sounds | 🔴 High |
gunshot |
Single or burst gunfire sounds | 🔴 High |
heavy_machine |
Bulldozers, excavators, trucks | 🔴 High |
normal |
Ambient forest sounds (birds, wind, rain) | 🟢 Safe |
Dataset Sources explored:
- ESC-50 Environmental Sound Dataset
- UrbanSound8K
- Custom-recorded / curated
.wavfiles - Augmented copies of minority classes to balance the dataset
Dataset expansion considered (for future versions):
Dangerous_Animals— predator callsWildlife_Large— elephant/rhino movementMistake_Mimicry— sounds that mimic threats but aren'tHuman_Activity— footsteps, voices (non-illegal)
All raw audio files were cleaned and standardized using a preprocessing script before training.
Standardization rules applied:
- Format: WAV only (MP3 / other formats converted using
librosaorffmpeg) - Duration: Exactly 3 seconds
- Files shorter than 3s → zero-padded at the end
- Files longer than 3s → trimmed from the start
- Sample Rate: 22,050 Hz (standard for audio ML)
- Channels: Mono (stereo files mixed down)
Result: All files in cleaned_data/ are uniform 3-second .wav mono files, verified by a validation script.
Raw waveforms are not fed into the CNN directly. Instead, each audio clip is converted into a Mel Spectrogram — a 2D image representation of sound.
Why Mel Spectrograms?
- They mimic how the human ear perceives frequency (logarithmic scale)
- They turn audio into an image → CNN can treat it like a visual pattern recognition problem
Parameters used:
SAMPLE_RATE = 22050 # Hz
DURATION = 3 # seconds
N_MELS = 64 # number of Mel filter banks (height of image)
MAX_LEN = 130 # fixed time-axis width (columns)Extraction pipeline:
import librosa
import numpy as np
def extract_mel(file_path):
audio, sr = librosa.load(file_path, sr=22050, duration=3)
# Pad or trim to exact 3 seconds
target_len = 22050 * 3
if len(audio) < target_len:
audio = np.pad(audio, (0, target_len - len(audio)))
else:
audio = audio[:target_len]
# Compute Mel Spectrogram and convert to dB scale
mel = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=64)
mel_db = librosa.power_to_db(mel, ref=np.max)
# Fix width to 130 columns
if mel_db.shape[1] < 130:
mel_db = np.pad(mel_db, ((0,0),(0, 130 - mel_db.shape[1])))
else:
mel_db = mel_db[:, :130]
return mel_db # Shape: (64, 130)Each sample becomes a (64 × 130) grayscale image fed into the CNN.
The model went through several iterations to improve accuracy:
| Version | Approach | Accuracy |
|---|---|---|
| Baseline | Logistic Regression on raw MFCC features | ~60–65% |
| v1 (Traditional ML) | Decision Tree + Random Forest comparison | ~75–80% |
| v2 (CNN v1) | 3-block CNN + Flatten + Dense layers | ~84.9% |
| v2 (CNN final) | 4-block CNN + GlobalAveragePooling2D + AdamW + SpecAugment | 97.0% |
Key improvements in the final version:
- Replaced
FlattenwithGlobalAveragePooling2D→ reduced overfitting - Added a 4th convolutional block for deeper feature learning
- Applied SpecAugment (masking random time/frequency bands during training)
- Expanded data augmentation to the
heavy_machineclass as well - Used AdamW optimizer (Adam with weight decay) instead of plain Adam
- Used label smoothing (factor 0.1) in the loss function to prevent overconfidence
Input: (64, 130, 1) ← Mel Spectrogram as grayscale image
Conv2D(32, 3×3, ReLU) → BatchNorm → MaxPool(2×2) → Dropout(0.25)
Conv2D(64, 3×3, ReLU) → BatchNorm → MaxPool(2×2) → Dropout(0.25)
Conv2D(128, 3×3, ReLU) → BatchNorm → MaxPool(2×2) → Dropout(0.25)
Conv2D(256, 3×3, ReLU) → BatchNorm → GlobalAveragePooling2D
Dense(256, ReLU) → Dropout(0.5)
Dense(128, ReLU) → Dropout(0.3)
Dense(4, Softmax) ← Output: probability for each of 4 classes
Key design choices:
BatchNormalizationafter every conv layer → stable and faster trainingGlobalAveragePooling2Dinstead of Flatten → reduces parameters, prevents overfitting- Progressive filter doubling (32 → 64 → 128 → 256) → learns from simple to complex patterns
Dropoutat both conv and dense layers → strong regularization
Data Augmentation (applied on-the-fly during training):
- Time stretching (±10%)
- Pitch shifting (±2 semitones)
- Adding Gaussian noise
- SpecAugment: random frequency masking + time masking
Training configuration:
optimizer = AdamW(learning_rate=1e-3, weight_decay=1e-4)
loss = CategoricalCrossentropy(label_smoothing=0.1)
epochs = 50
batch_size = 32
val_split = 0.2 # 80% train / 20% validationCallbacks used:
EarlyStopping(patience=10, restore_best_weights=True)— stops if val_loss stops improvingReduceLROnPlateau(factor=0.5, patience=5)— halves LR when val_loss plateausModelCheckpoint— saves the best model weights automatically
Dataset balance: Classes with fewer samples were upsampled via augmentation to match the largest class (avoiding bias toward the majority class).
To ensure consistent input to the model at inference time, the training data mean and standard deviation are computed once and saved to norm_stats.npy.
# During training
mean = X_train.mean()
std = X_train.std()
np.save('norm_stats.npy', {'mean': mean, 'std': std})
# Normalization applied to train, val, and inference
X_normalized = (X - mean) / stdAt inference time (app.py), the same stats are loaded and applied before prediction — this is critical for model correctness.
The final model was evaluated on a held-out test set of original (non-augmented) forest audio files.
| Metric | Value |
|---|---|
| Overall Accuracy | 94.4.0% |
| Weighted F1-Score | 0.94.4 |
| Chainsaw Recall | 95% |
| Gunshot Recall | 95% |
| Heavy Machine Recall | 93% |
| Normal Forest Recall | 95% |
Confidence Threshold: A prediction is only accepted if the model's top class probability is ≥ 65%. Otherwise the audio is flagged as OTHER / AMBIENT to prevent false alarms from unknown sounds.
The trained model is served through a Streamlit web dashboard (app.py).
User flow:
- User uploads a
.wavfile - App extracts the Mel Spectrogram
- Spectrogram is normalized using saved
norm_stats.npy - CNN model predicts class probabilities
- Result displayed with:
- 🔴 Red alert for chainsaw / gunshot / heavy machine
- 🟢 Green for normal forest sounds
- ⚫ Gray for unknown / low-confidence sounds
- Full probability distribution shown as a progress bar chart
Run locally:
streamlit run app.pyThe app is deployed on Streamlit Community Cloud (free tier):
- Push the repository to a public GitHub repo
- Go to share.streamlit.io and log in with GitHub
- Select
app.pyas the entry point - Streamlit Cloud reads
requirements.txt(Python packages) andpackages.txt(Linux system libraries, e.g.libsndfile1for audio support) - App is live at a public URL — no server management needed
Key files for deployment:
| File | Purpose |
|---|---|
app.py |
Entry point for Streamlit |
forest_sound_model_v2.h5 |
Trained model weights |
norm_stats.npy |
Normalization statistics |
requirements.txt |
Python dependencies |
packages.txt |
Linux system dependencies |
Built for forest conservation and environmental protection. 🌿