Skip to content

ahmeda-42/Environmental-Audio-CNN-Classifier

Repository files navigation

Environmental Audio CNN Classifier

A full-stack ML project that detects and classifies environmental sounds using deep learning with ~71.9% accuracy. The system extracts log-mel spectrogram features from audio through signal processing and fourier transforms, runs a 2D convolutional neural network (CNN) for inference (PyTorch), and provides a React-based web interface for file upload and real-time visualization. Includes a FastAPI backend, real-time WebSocket streaming, and Docker support.

The model is trained on the UrbanSound8K dataset and is saved as artifacts/cnn.pt. UrbanSound8K contains 8,732 labeled environmental audio clips across 10 classes. Clips are up to 4 seconds long and are split into 10 predefined folds, with labels stored in a CSV containing file_path, label, and fold. Training uses folds 1–8, validation uses fold 9, and testing/evaluation uses fold 10, so evaluation happens on a held-out fold rather than random splits.

Live Demo

The full stack is deployed on Render. Try it here: https://environmental-audio-cnn-classifier-ce92.onrender.com

Note: Render can be slow and may struggle with large uploads or long audio clips. For a much faster, smoother experience (and to handle larger requests), run this project locally by following the steps in the "Try It Yourself!!" section.

Key Features

Audio classification pipeline:

  • Log-mel spectrogram feature extraction through signal processing and fourier transforms
  • Trains a 2D convolutional neural network (CNN) on spectrograms (PyTorch)
  • Class-balanced loss, learning-rate scheduling, and early stopping
  • Optional waveform augmentation (time-stretch, pitch-shift, noise, time shift)
  • Optional SpecAugment + RMS normalization to improve accuracy
  • 50% overlap windowing for long audio with averaged predictions

FastAPI backend:

  • REST endpoints for predictions and spectrograms
  • WebSocket streaming for real-time inference

React frontend:

  • File upload + microphone streaming
  • Live spectrogram visualization with axes and colorbar
  • Displays top predictions and confidence probabilities

Deployed:

  • Live Render full stack demo for quick access

Pytest test suite:

  • API tests (REST + WebSocket)
  • Model loading tests
  • Prediction logic and preprocessing tests

Dockerized:

  • One-command containerized deployment for API + frontend

Project Structure

Environmental-Audio-CNN-Classifier/
├── app/                         # FastAPI app + WebSocket handler
│   ├── main.py                  # API entry point + routes
│   ├── schemas.py               # Pydantic request/response models
│   └── websocket_handler.py     # Real-time streaming logic
├── model/                       # CNN, training, evaluation, prediction
│   ├── cnn.py                   # AudioCNN architecture
│   ├── dataset.py               # Dataset + SpecAugment
│   ├── train.py                 # Training script
│   ├── evaluate.py              # Evaluation script
│   ├── predict.py               # Inference + windowing
│   ├── load_model.py            # Model loading helper
│   └── try_predict.py           # Example prediction script
├── preprocessing/               # Audio features + utilities
│   ├── audio_features.py        # Load audio + compute spectrogram
│   ├── prepare_urbansound8k.py  # Build CSV from dataset
│   └── visualize_spectrogram.py # Spectrogram encoding + metadata
├── frontend/                    # React UI (Vite)
│   ├── src/
│   ├── Dockerfile
│   └── nginx.conf
├── tests/                       # Pytest suite
│   ├── test_api.py
│   ├── test_model.py
│   ├── test_predict.py
│   ├── test_preprocessing.py
│   └── test_websocket.py
├── artifacts/                   # Saved model + labels (not committed)
├── data/                        # Dataset + CSV (not committed)
├── config.py                    # Centralized settings
├── Dockerfile                   # Backend container
├── docker-compose.yml           # Run API + frontend
├── requirements.txt
└── README.md

Tech Stack

Machine Learning / Audio Processing

  • PyTorch (CNN training + inference)
  • Librosa + SoundFile (audio loading, log-mel spectrograms)
  • NumPy + Pandas (data handling, fourier transforms)
  • scikit-learn (evaluation metrics)

Backend

  • FastAPI
  • Pydantic

Deployment

  • Render (frontend + backend hosting)
  • Uvicorn
  • WebSockets

Frontend

  • React
  • Vite
  • CSS

DevOps / Tooling

  • Docker
  • Nginx
  • Pytest
  • HTTPX

Metrics

From the current evaluation run on the held-out test fold (fold 10):

Classification Report:                                                                                              
                  precision    recall  f1-score   support

 air_conditioner      0.656     0.800     0.721       100
        car_horn      0.800     0.727     0.762        33
children_playing      0.629     0.830     0.716       100
        dog_bark      0.977     0.420     0.587       100
        drilling      0.954     0.620     0.752       100
   engine_idling      0.618     0.505     0.556        93
        gun_shot      0.653     1.000     0.790        32
      jackhammer      0.667     0.833     0.741        96
           siren      0.692     0.892     0.779        83
    street_music      0.839     0.780     0.808       100

        accuracy                          0.719       837
       macro avg      0.748     0.741     0.721       837
    weighted avg      0.755     0.719     0.712       837


Confusion Matrix:
[[80  0 14  0  0  0  0  4  2  0]
 [ 1 24  0  0  0  1  0  4  0  3]
 [ 6  1 83  0  1  3  0  0  2  4]
 [ 4  0 18 42  1  4 11  5  7  8]
 [ 9  0  3  1 62  3  5  7 10  0]
 [17  0  2  0  0 47  0 20  7  0]
 [ 0  0  0  0  0  0 32  0  0  0]
 [ 4  0  0  0  0 12  0 80  0  0]
 [ 0  0  2  0  0  6  1  0 74  0]
 [ 1  5 10  0  1  0  0  0  5 78]]

Note: You can evaluate it yourself by running python model/evaluate.py (look at the steps below)

Potential use cases:

  • Smart city monitoring (detecting sirens, construction noise, traffic sounds)
  • Assistive technologies for accessibility (alerting users to critical sounds)
  • Edge AI / IoT sound detection systems (home security, wildlife monitoring)
  • Inputs to autonomous vehicles and robots (secondary to LiDAR/Cameras)

Try It Yourself!!

1. Prepare the dataset (UrbanSound8K)

Download and extract UrbanSound8K to data/UrbanSound8K from https://urbansounddataset.weebly.com/download-urbansound8k.html

2. Setup virtual environment

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

3. Train the model

python model/train.py

Trains with folds 1–8, validates on fold 9, and saves:

  • artifacts/cnn.pt
  • artifacts/cnn.pt.labels.json

4. Evaluate the model

python model/evaluate.py

Evaluates on fold 10 and prints the classification report and confusion matrix.

5. Run the FastAPI backend

export AUDIO_LOADER=librosa
uvicorn app.main:app --reload

Base URL: http://127.0.0.1:8000

Endpoints:

  • GET /health
  • GET /labels
  • GET /config
  • POST /predict (multipart file upload)
  • POST /spectrogram (multipart file upload)
  • WS /ws/predict (float32 PCM streaming)

Note: Instead of running steps 5-6 locally, you can run both services via the Docker section below.

6. Run the React frontend

cd frontend
npm install
npm run dev

Default API base: http://127.0.0.1:8000
Override with: VITE_API_BASE

Note: Instead of running steps 5-6 locally, you can run both services via the Docker section below.

Docker

docker compose up --build
  • Frontend: http://localhost:5173
  • API: http://localhost:8000

Testing

Run the full test suite:

pytest

Tests cover:

  • REST endpoints and responses
  • Model loading
  • Prediction windowing logic
  • Preprocessing + spectrogram metadata
  • WebSocket streaming

Key Configuration

Edit config.py to change:

  • Audio: SAMPLE_RATE, DURATION, N_MELS, N_FFT, HOP_LENGTH
  • Streaming: STREAM_DURATION, STREAM_N_MELS
  • Normalization: RMS_NORMALIZE, RMS_TARGET
  • Training: BATCH_SIZE, LEARNING_RATE, WEIGHT_DECAY, EPOCHS, SEED
  • Scheduler/Early stop: SCHEDULER_PATIENCE, SCHEDULER_FACTOR, EARLY_STOPPING_PATIENCE
  • Class balance: USE_CLASS_WEIGHTS
  • SpecAugment: SPEC_AUGMENT, SPEC_AUGMENT_STRENGTH, TIME_MASK_PARAM, FREQ_MASK_PARAM, NUM_TIME_MASKS, NUM_FREQ_MASKS
  • Waveform aug: AUG_TIME_STRETCH, TIME_STRETCH_RANGE, AUG_PITCH_SHIFT, PITCH_SHIFT_STEPS, AUG_NOISE, NOISE_STD, AUG_TIME_SHIFT, TIME_SHIFT_MAX_FRACTION

Environment Variables

  • ALLOWED_ORIGINS (comma-separated): override CORS allowlist
  • MAX_UPLOAD_MB: max upload size (default: 50)
  • AUDIO_LOADER: ffmpeg (default), soundfile, or librosa
  • FFMPEG_TIMEOUT_SECONDS: ffmpeg decode timeout (default: 20)

Predict API Options

  • reduce_payload=true skips spectrogram payload to speed up long requests

About

Environmental Audio CNN Classifier

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors