Environmental Audio CNN Classifier

A full-stack ML project that detects and classifies environmental sounds using deep learning with ~71.9% accuracy. The system extracts log-mel spectrogram features from audio through signal processing and fourier transforms, runs a 2D convolutional neural network (CNN) for inference (PyTorch), and provides a React-based web interface for file upload and real-time visualization. Includes a FastAPI backend, real-time WebSocket streaming, and Docker support.

The model is trained on the UrbanSound8K dataset and is saved as artifacts/cnn.pt. UrbanSound8K contains 8,732 labeled environmental audio clips across 10 classes. Clips are up to 4 seconds long and are split into 10 predefined folds, with labels stored in a CSV containing file_path, label, and fold. Training uses folds 1–8, validation uses fold 9, and testing/evaluation uses fold 10, so evaluation happens on a held-out fold rather than random splits.

Live Demo

The full stack is deployed on Render. Try it here: https://environmental-audio-cnn-classifier-ce92.onrender.com

Note: Render can be slow and may struggle with large uploads or long audio clips. For a much faster, smoother experience (and to handle larger requests), run this project locally by following the steps in the "Try It Yourself!!" section.

Key Features

Audio classification pipeline:

Log-mel spectrogram feature extraction through signal processing and fourier transforms
Trains a 2D convolutional neural network (CNN) on spectrograms (PyTorch)
Class-balanced loss, learning-rate scheduling, and early stopping
Optional waveform augmentation (time-stretch, pitch-shift, noise, time shift)
Optional SpecAugment + RMS normalization to improve accuracy
50% overlap windowing for long audio with averaged predictions

FastAPI backend:

REST endpoints for predictions and spectrograms
WebSocket streaming for real-time inference

React frontend:

File upload + microphone streaming
Live spectrogram visualization with axes and colorbar
Displays top predictions and confidence probabilities

Deployed:

Live Render full stack demo for quick access

Pytest test suite:

API tests (REST + WebSocket)
Model loading tests
Prediction logic and preprocessing tests

Dockerized:

One-command containerized deployment for API + frontend

Project Structure

Environmental-Audio-CNN-Classifier/
├── app/                         # FastAPI app + WebSocket handler
│   ├── main.py                  # API entry point + routes
│   ├── schemas.py               # Pydantic request/response models
│   └── websocket_handler.py     # Real-time streaming logic
├── model/                       # CNN, training, evaluation, prediction
│   ├── cnn.py                   # AudioCNN architecture
│   ├── dataset.py               # Dataset + SpecAugment
│   ├── train.py                 # Training script
│   ├── evaluate.py              # Evaluation script
│   ├── predict.py               # Inference + windowing
│   ├── load_model.py            # Model loading helper
│   └── try_predict.py           # Example prediction script
├── preprocessing/               # Audio features + utilities
│   ├── audio_features.py        # Load audio + compute spectrogram
│   ├── prepare_urbansound8k.py  # Build CSV from dataset
│   └── visualize_spectrogram.py # Spectrogram encoding + metadata
├── frontend/                    # React UI (Vite)
│   ├── src/
│   ├── Dockerfile
│   └── nginx.conf
├── tests/                       # Pytest suite
│   ├── test_api.py
│   ├── test_model.py
│   ├── test_predict.py
│   ├── test_preprocessing.py
│   └── test_websocket.py
├── artifacts/                   # Saved model + labels (not committed)
├── data/                        # Dataset + CSV (not committed)
├── config.py                    # Centralized settings
├── Dockerfile                   # Backend container
├── docker-compose.yml           # Run API + frontend
├── requirements.txt
└── README.md

Tech Stack

Machine Learning / Audio Processing

PyTorch (CNN training + inference)
Librosa + SoundFile (audio loading, log-mel spectrograms)
NumPy + Pandas (data handling, fourier transforms)
scikit-learn (evaluation metrics)

Backend

FastAPI
Pydantic

Deployment

Render (frontend + backend hosting)
Uvicorn
WebSockets

Frontend

React
Vite
CSS

DevOps / Tooling

Docker
Nginx
Pytest
HTTPX

Metrics

From the current evaluation run on the held-out test fold (fold 10):

Classification Report:                                                                                              
                  precision    recall  f1-score   support

 air_conditioner      0.656     0.800     0.721       100
        car_horn      0.800     0.727     0.762        33
children_playing      0.629     0.830     0.716       100
        dog_bark      0.977     0.420     0.587       100
        drilling      0.954     0.620     0.752       100
   engine_idling      0.618     0.505     0.556        93
        gun_shot      0.653     1.000     0.790        32
      jackhammer      0.667     0.833     0.741        96
           siren      0.692     0.892     0.779        83
    street_music      0.839     0.780     0.808       100

        accuracy                          0.719       837
       macro avg      0.748     0.741     0.721       837
    weighted avg      0.755     0.719     0.712       837


Confusion Matrix:
[[80  0 14  0  0  0  0  4  2  0]
 [ 1 24  0  0  0  1  0  4  0  3]
 [ 6  1 83  0  1  3  0  0  2  4]
 [ 4  0 18 42  1  4 11  5  7  8]
 [ 9  0  3  1 62  3  5  7 10  0]
 [17  0  2  0  0 47  0 20  7  0]
 [ 0  0  0  0  0  0 32  0  0  0]
 [ 4  0  0  0  0 12  0 80  0  0]
 [ 0  0  2  0  0  6  1  0 74  0]
 [ 1  5 10  0  1  0  0  0  5 78]]

Note: You can evaluate it yourself by running python model/evaluate.py (look at the steps below)

Potential use cases:

Smart city monitoring (detecting sirens, construction noise, traffic sounds)
Assistive technologies for accessibility (alerting users to critical sounds)
Edge AI / IoT sound detection systems (home security, wildlife monitoring)
Inputs to autonomous vehicles and robots (secondary to LiDAR/Cameras)

Try It Yourself!!

1. Prepare the dataset (UrbanSound8K)

Download and extract UrbanSound8K to data/UrbanSound8K from https://urbansounddataset.weebly.com/download-urbansound8k.html

2. Setup virtual environment

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

3. Train the model

python model/train.py

Trains with folds 1–8, validates on fold 9, and saves:

artifacts/cnn.pt
artifacts/cnn.pt.labels.json

4. Evaluate the model

python model/evaluate.py

Evaluates on fold 10 and prints the classification report and confusion matrix.

5. Run the FastAPI backend

export AUDIO_LOADER=librosa
uvicorn app.main:app --reload

Base URL: http://127.0.0.1:8000

Endpoints:

GET /health
GET /labels
GET /config
POST /predict (multipart file upload)
POST /spectrogram (multipart file upload)
WS /ws/predict (float32 PCM streaming)

Note: Instead of running steps 5-6 locally, you can run both services via the Docker section below.

6. Run the React frontend

cd frontend
npm install
npm run dev

Default API base: http://127.0.0.1:8000
Override with: VITE_API_BASE

Note: Instead of running steps 5-6 locally, you can run both services via the Docker section below.

Docker

docker compose up --build

Frontend: http://localhost:5173
API: http://localhost:8000

Testing

Run the full test suite:

pytest

Tests cover:

REST endpoints and responses
Model loading
Prediction windowing logic
Preprocessing + spectrogram metadata
WebSocket streaming

Key Configuration

Edit config.py to change:

Audio: SAMPLE_RATE, DURATION, N_MELS, N_FFT, HOP_LENGTH
Streaming: STREAM_DURATION, STREAM_N_MELS
Normalization: RMS_NORMALIZE, RMS_TARGET
Training: BATCH_SIZE, LEARNING_RATE, WEIGHT_DECAY, EPOCHS, SEED
Scheduler/Early stop: SCHEDULER_PATIENCE, SCHEDULER_FACTOR, EARLY_STOPPING_PATIENCE
Class balance: USE_CLASS_WEIGHTS
SpecAugment: SPEC_AUGMENT, SPEC_AUGMENT_STRENGTH, TIME_MASK_PARAM, FREQ_MASK_PARAM, NUM_TIME_MASKS, NUM_FREQ_MASKS
Waveform aug: AUG_TIME_STRETCH, TIME_STRETCH_RANGE, AUG_PITCH_SHIFT, PITCH_SHIFT_STEPS, AUG_NOISE, NOISE_STD, AUG_TIME_SHIFT, TIME_SHIFT_MAX_FRACTION

Environment Variables

ALLOWED_ORIGINS (comma-separated): override CORS allowlist
MAX_UPLOAD_MB: max upload size (default: 50)
AUDIO_LOADER: ffmpeg (default), soundfile, or librosa
FFMPEG_TIMEOUT_SECONDS: ffmpeg decode timeout (default: 20)

Predict API Options

reduce_payload=true skips spectrogram payload to speed up long requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Environmental Audio CNN Classifier

Live Demo

Key Features

Project Structure

Tech Stack

Metrics

Potential use cases:

Try It Yourself!!

1. Prepare the dataset (UrbanSound8K)

2. Setup virtual environment

3. Train the model

4. Evaluate the model

5. Run the FastAPI backend

6. Run the React frontend

Docker

Testing

Key Configuration

Environment Variables

Predict API Options

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
app		app
artifacts		artifacts
frontend		frontend
model		model
preprocessing		preprocessing
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
config.py		config.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Environmental Audio CNN Classifier

Live Demo

Key Features

Project Structure

Tech Stack

Metrics

Potential use cases:

Try It Yourself!!

1. Prepare the dataset (UrbanSound8K)

2. Setup virtual environment

3. Train the model

4. Evaluate the model

5. Run the FastAPI backend

6. Run the React frontend

Docker

Testing

Key Configuration

Environment Variables

Predict API Options

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages