A full-stack ML project that detects and classifies environmental sounds using deep learning with ~71.9% accuracy. The system extracts log-mel spectrogram features from audio through signal processing and fourier transforms, runs a 2D convolutional neural network (CNN) for inference (PyTorch), and provides a React-based web interface for file upload and real-time visualization. Includes a FastAPI backend, real-time WebSocket streaming, and Docker support.
The model is trained on the UrbanSound8K dataset and is saved as artifacts/cnn.pt. UrbanSound8K contains 8,732 labeled environmental audio clips across 10 classes. Clips are up to 4 seconds long and are split into 10 predefined folds, with labels stored in a CSV containing file_path, label, and fold. Training uses folds 1–8, validation uses fold 9, and testing/evaluation uses fold 10, so evaluation happens on a held-out fold rather than random splits.
The full stack is deployed on Render. Try it here: https://environmental-audio-cnn-classifier-ce92.onrender.com
Note: Render can be slow and may struggle with large uploads or long audio clips. For a much faster, smoother experience (and to handle larger requests), run this project locally by following the steps in the "Try It Yourself!!" section.
Audio classification pipeline:
- Log-mel spectrogram feature extraction through signal processing and fourier transforms
- Trains a 2D convolutional neural network (CNN) on spectrograms (PyTorch)
- Class-balanced loss, learning-rate scheduling, and early stopping
- Optional waveform augmentation (time-stretch, pitch-shift, noise, time shift)
- Optional SpecAugment + RMS normalization to improve accuracy
- 50% overlap windowing for long audio with averaged predictions
FastAPI backend:
- REST endpoints for predictions and spectrograms
- WebSocket streaming for real-time inference
React frontend:
- File upload + microphone streaming
- Live spectrogram visualization with axes and colorbar
- Displays top predictions and confidence probabilities
Deployed:
- Live Render full stack demo for quick access
Pytest test suite:
- API tests (REST + WebSocket)
- Model loading tests
- Prediction logic and preprocessing tests
Dockerized:
- One-command containerized deployment for API + frontend
Environmental-Audio-CNN-Classifier/
├── app/ # FastAPI app + WebSocket handler
│ ├── main.py # API entry point + routes
│ ├── schemas.py # Pydantic request/response models
│ └── websocket_handler.py # Real-time streaming logic
├── model/ # CNN, training, evaluation, prediction
│ ├── cnn.py # AudioCNN architecture
│ ├── dataset.py # Dataset + SpecAugment
│ ├── train.py # Training script
│ ├── evaluate.py # Evaluation script
│ ├── predict.py # Inference + windowing
│ ├── load_model.py # Model loading helper
│ └── try_predict.py # Example prediction script
├── preprocessing/ # Audio features + utilities
│ ├── audio_features.py # Load audio + compute spectrogram
│ ├── prepare_urbansound8k.py # Build CSV from dataset
│ └── visualize_spectrogram.py # Spectrogram encoding + metadata
├── frontend/ # React UI (Vite)
│ ├── src/
│ ├── Dockerfile
│ └── nginx.conf
├── tests/ # Pytest suite
│ ├── test_api.py
│ ├── test_model.py
│ ├── test_predict.py
│ ├── test_preprocessing.py
│ └── test_websocket.py
├── artifacts/ # Saved model + labels (not committed)
├── data/ # Dataset + CSV (not committed)
├── config.py # Centralized settings
├── Dockerfile # Backend container
├── docker-compose.yml # Run API + frontend
├── requirements.txt
└── README.mdMachine Learning / Audio Processing
- PyTorch (CNN training + inference)
- Librosa + SoundFile (audio loading, log-mel spectrograms)
- NumPy + Pandas (data handling, fourier transforms)
- scikit-learn (evaluation metrics)
Backend
- FastAPI
- Pydantic
Deployment
- Render (frontend + backend hosting)
- Uvicorn
- WebSockets
Frontend
- React
- Vite
- CSS
DevOps / Tooling
- Docker
- Nginx
- Pytest
- HTTPX
From the current evaluation run on the held-out test fold (fold 10):
Classification Report:
precision recall f1-score support
air_conditioner 0.656 0.800 0.721 100
car_horn 0.800 0.727 0.762 33
children_playing 0.629 0.830 0.716 100
dog_bark 0.977 0.420 0.587 100
drilling 0.954 0.620 0.752 100
engine_idling 0.618 0.505 0.556 93
gun_shot 0.653 1.000 0.790 32
jackhammer 0.667 0.833 0.741 96
siren 0.692 0.892 0.779 83
street_music 0.839 0.780 0.808 100
accuracy 0.719 837
macro avg 0.748 0.741 0.721 837
weighted avg 0.755 0.719 0.712 837
Confusion Matrix:
[[80 0 14 0 0 0 0 4 2 0]
[ 1 24 0 0 0 1 0 4 0 3]
[ 6 1 83 0 1 3 0 0 2 4]
[ 4 0 18 42 1 4 11 5 7 8]
[ 9 0 3 1 62 3 5 7 10 0]
[17 0 2 0 0 47 0 20 7 0]
[ 0 0 0 0 0 0 32 0 0 0]
[ 4 0 0 0 0 12 0 80 0 0]
[ 0 0 2 0 0 6 1 0 74 0]
[ 1 5 10 0 1 0 0 0 5 78]]
Note: You can evaluate it yourself by running python model/evaluate.py (look at the steps below)
- Smart city monitoring (detecting sirens, construction noise, traffic sounds)
- Assistive technologies for accessibility (alerting users to critical sounds)
- Edge AI / IoT sound detection systems (home security, wildlife monitoring)
- Inputs to autonomous vehicles and robots (secondary to LiDAR/Cameras)
Download and extract UrbanSound8K to data/UrbanSound8K from https://urbansounddataset.weebly.com/download-urbansound8k.html
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtpython model/train.pyTrains with folds 1–8, validates on fold 9, and saves:
artifacts/cnn.ptartifacts/cnn.pt.labels.json
python model/evaluate.pyEvaluates on fold 10 and prints the classification report and confusion matrix.
export AUDIO_LOADER=librosa
uvicorn app.main:app --reloadBase URL: http://127.0.0.1:8000
Endpoints:
GET /healthGET /labelsGET /configPOST /predict(multipart file upload)POST /spectrogram(multipart file upload)WS /ws/predict(float32 PCM streaming)
Note: Instead of running steps 5-6 locally, you can run both services via the Docker section below.
cd frontend
npm install
npm run devDefault API base: http://127.0.0.1:8000
Override with: VITE_API_BASE
Note: Instead of running steps 5-6 locally, you can run both services via the Docker section below.
docker compose up --build- Frontend:
http://localhost:5173 - API:
http://localhost:8000
Run the full test suite:
pytestTests cover:
- REST endpoints and responses
- Model loading
- Prediction windowing logic
- Preprocessing + spectrogram metadata
- WebSocket streaming
Edit config.py to change:
- Audio:
SAMPLE_RATE,DURATION,N_MELS,N_FFT,HOP_LENGTH - Streaming:
STREAM_DURATION,STREAM_N_MELS - Normalization:
RMS_NORMALIZE,RMS_TARGET - Training:
BATCH_SIZE,LEARNING_RATE,WEIGHT_DECAY,EPOCHS,SEED - Scheduler/Early stop:
SCHEDULER_PATIENCE,SCHEDULER_FACTOR,EARLY_STOPPING_PATIENCE - Class balance:
USE_CLASS_WEIGHTS - SpecAugment:
SPEC_AUGMENT,SPEC_AUGMENT_STRENGTH,TIME_MASK_PARAM,FREQ_MASK_PARAM,NUM_TIME_MASKS,NUM_FREQ_MASKS - Waveform aug:
AUG_TIME_STRETCH,TIME_STRETCH_RANGE,AUG_PITCH_SHIFT,PITCH_SHIFT_STEPS,AUG_NOISE,NOISE_STD,AUG_TIME_SHIFT,TIME_SHIFT_MAX_FRACTION
ALLOWED_ORIGINS(comma-separated): override CORS allowlistMAX_UPLOAD_MB: max upload size (default: 50)AUDIO_LOADER:ffmpeg(default),soundfile, orlibrosaFFMPEG_TIMEOUT_SECONDS: ffmpeg decode timeout (default: 20)
reduce_payload=trueskips spectrogram payload to speed up long requests