SonoNet – Speech Emotion Recognition Analytics

A deep learning project for Speech Emotion Recognition (SER) that benchmarks three complementary neural approaches: CNN (mel-spectrogram), fine-tuned wav2vec 2.0 (raw waveform), and BiGRU + Self-Attention (mel-spectrogram sequence) on the CREMA-D dataset.

Project Goals

Classify speech clips into 6 emotion classes:
- Anger, Disgust, Fear, Happiness, Neutral, Sadness
Compare how different modeling paradigms behave on SER:
- Spectral (image-like) learning via CNNs
- Self-supervised pretrained speech representations via wav2vec 2.0
- Sequential temporal modeling via BiGRU + attention
Improve robustness under dataset constraints using preprocessing + augmentation + regularization

Dataset

CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset)

7,000+ audio recordings
91 actors (48 male, 43 female), ages 20–74
6 categorical emotions: anger, disgust, fear, happy, neutral, sadness
12 fixed sentence prompts (identified by a 3-letter acronym in filenames)
4 emotion intensity levels: Low (LO), Medium (MD), High (HI), Unspecified (XX)

Note: The dataset includes intensity in filenames, but the focus here is 6-way emotion classification.

High-Level System Architecture

End-to-end flow

Ingest audio from CREMA-D
Standardize duration (pad/trim)
Feature representation (depends on model family)
Train model
Evaluate using accuracy, precision/recall/F1, and confusion matrices

Two input paradigms used

Mel-spectrogram pipeline → used by CNN and BiGRU
Raw waveform pipeline → used by wav2vec 2.0

Exploratory Data Analysis Highlights

Dataset is largely balanced across emotion classes
- Most classes have ~1,271 clips
- Neutral has slightly fewer (~1,087)
Mean audio duration is about 2.54 seconds, with per-emotion averages roughly 2.34–2.78 seconds
These findings informed the use of a fixed 3-second target window for spectrogram-based models

Preprocessing and Data Preparation

Spectrogram-based preprocessing (CNN + BiGRU)

Audio is converted into a fixed-size log-mel spectrogram:

Sample rate: 16 kHz
Target duration: 3 seconds
- Pad shorter clips
- Truncate longer clips
Mel-spectrogram parameters:
- 128 mel bins
- FFT window (n_fft): 2048
- Hop length: 512
Convert power spectrogram → decibel (log) scale
Pad/trim in time axis to produce consistent tensor shapes for batch training

Raw waveform preprocessing (wav2vec 2.0)

Audio is passed as raw waveform to wav2vec2’s feature extractor / frontend
Avoids manual feature engineering (e.g., MFCCs or mel-spectrograms) by leveraging pretrained representations

Data Augmentation and Regularization Techniques

To reduce overfitting and increase robustness:

SpecAugment (spectrogram augmentation)
- Random time masking and frequency masking
Gaussian noise injection
Normalization
Dropout (notably effective when increased deeper in the CNN)
Early stopping (stop when validation loss fails to improve)

Models Used

SonoNet benchmarks three model families. Each is designed to capture a different aspect of emotional information in speech.

1) Spectrogram CNN (Baseline)

Input: log-mel spectrogram (image-like time–frequency map)

Motivation: Use CNNs to learn hierarchical spectral patterns associated with emotional cues (pitch-energy distribution, timbre, and spectral-temporal activations).

Architecture:

4 convolutional blocks
- Convolution + ReLU
- Max pooling
- Batch normalization
- Progressively increasing dropout in downstream layers
Final classifier for 6 emotion categories

Results:

Iteration showed clear tradeoffs:
- smaller CNNs underfit
- deeper CNNs overfit
- fixed dropout alone wasn’t sufficient
A strong improvement came from combining:
- max pooling
- progressive dropout
- SpecAugment + controlled Gaussian noise injection
Final reported performance: ~65% accuracy on both training and test sets, with stable convergence after augmentation was introduced.

CNN Diagrams (accuracy, loss, confusion):

2) Fine-tuned wav2vec 2.0 (Transformer-based)

Input: raw waveform

Motivation: Leverage self-supervised pretrained speech representations to reduce manual feature engineering and improve emotion recognition by transferring robust acoustic representations into the SER task.

Architecture (high level):

Feature extractor: 7 stacked temporal convolution layers (downsampling + low-level acoustic pattern learning)
Feature projection: linear map into transformer embedding space
Transformer encoder: multi-head self-attention + feedforward blocks (with positional info, dropout, layer norm)
Classification head: linear layer mapping to 6 outputs

Results:

Reported performance: ~69% test accuracy, showing strong transfer from pretrained wav2vec2 representations.
Training converged quickly, but fine-tuning exhibited overfitting:
- overfitting emerges around epochs 6–8
- training accuracy rose rapidly from ~50% to ~90% by later epochs
- validation accuracy plateaued around ~68%
- widening gap of roughly 20–22 percentage points
Confusion-matrix analysis highlights recurring confusions among prosodically similar emotions (notably sadness ↔ fear), and difficulty with disgust.

wav2vec 2.0 Diagrams (accuracy, loss, confusion):

3) BiGRU + Self-Attention (Sequential Spectrogram Model)

Input: log-mel spectrogram interpreted as a sequence of spectral vectors over time

Motivation: Model temporal dynamics of emotion (prosody, pitch contours, energy shifts) using recurrent layers, then use attention to focus on the most emotionally salient frames.

Architecture:

3 stacked Bidirectional GRU layers
- Captures both forward + backward context across the utterance
Self-attention pooling
- Produces a weighted context vector
- Emphasizes emotionally salient segments (e.g., stressed syllables, abrupt tonal changes)
Dropout
Fully connected classifier
Training techniques referenced include:
- AdamW with weight decay
- Gradient clipping
- ReduceLROnPlateau
- SpecAugment

Results:

Reported performance indicates ~65% validation accuracy (consistent with the plotted curves and discussion).
Confusion matrix shows strong diagonal dominance with notable confusions such as fear → sadness (a common SER challenge due to overlapping low-arousal/negative-valence acoustic traits).
Training dynamics (from the report’s curves):
- training loss decreases steadily (approx. 1.6 → 0.8 over ~50 epochs)
- training accuracy rises to ~70%
- test/validation accuracy plateaus around ~65%
- mild train–test gap suggests mild overfitting, improved by regularization + SpecAugment

BiGRU Diagrams (accuracy, loss, confusion):

Evaluation Metrics

Models are evaluated using:

Accuracy
Precision / Recall / F1 (class-wise)
Confusion matrices

Metrics

Emotion	CNN (Precision/Recall/F1)	wav2vec2 (Precision/Recall/F1)	BiGRU (Precision/RecallR/F1)
Anger	0.81 / 0.63 / 0.71	0.72 / 0.85 / 0.78	0.74 / 0.81 / 0.78
Disgust	0.51 / 0.66 / 0.58	0.80 / 0.39 / 0.52	0.64 / 0.57 / 0.60
Fear	0.59 / 0.54 / 0.56	0.45 / 0.81 / 0.58	0.57 / 0.52 / 0.54
Happiness	0.67 / 0.57 / 0.62	0.73 / 0.65 / 0.69	0.61 / 0.60 / 0.61
Neutral	0.76 / 0.64 / 0.69	0.75 / 0.72 / 0.74	0.65 / 0.72 / 0.68
Sadness	0.55 / 0.72 / 0.62	0.63 / 0.41 / 0.50	0.57 / 0.59 / 0.58

Common Error Patterns Observed

Disgust / Fear / Sadness are frequently confusable (shared negative-valence acoustic traits)
wav2vec2 shows strong performance on some classes but weaker balance on others, consistent with confusion-matrix discussion
BiGRU shows notable fear ↔ sadness confusions, consistent with prosodic overlap

Tooling and Implementation Stack

PyTorch for model training and experimentation
librosa for mel-spectrogram extraction and audio preprocessing (spectrogram models)
wav2vec2 via the Hugging Face / PyTorch ecosystem for pretrained raw speech modeling
Training performed on GPU (NVIDIA T4)

Limitations

Compute constraints (single T4 GPU) limited:
- number of experiments / hyperparameter sweeps
- sequence length / model depth exploration
Overfitting observed, especially for larger-capacity models (wav2vec2, BiGRU)

Future Improvements

Stronger overfitting mitigation for wav2vec2 (e.g., freezing strategies, stronger regularization)
Targeted augmentation to separate confusable classes (pitch/tempo perturbations)
Hybrid/ensemble approaches combining CNN spectral learning + wav2vec2 representations
Incorporate arousal/intensity features into model

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
figures		figures
.gitignore		.gitignore
MEL+CNN.ipynb		MEL+CNN.ipynb
MEL+GRU+Self_Attention_Polling.ipynb		MEL+GRU+Self_Attention_Polling.ipynb
README.md		README.md
Wav2Vec2.ipynb		Wav2Vec2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SonoNet – Speech Emotion Recognition Analytics

Project Goals

Dataset

High-Level System Architecture

End-to-end flow

Two input paradigms used

Exploratory Data Analysis Highlights

Preprocessing and Data Preparation

Spectrogram-based preprocessing (CNN + BiGRU)

Raw waveform preprocessing (wav2vec 2.0)

Data Augmentation and Regularization Techniques

Models Used

1) Spectrogram CNN (Baseline)

2) Fine-tuned wav2vec 2.0 (Transformer-based)

3) BiGRU + Self-Attention (Sequential Spectrogram Model)

Evaluation Metrics

Metrics

Common Error Patterns Observed

Tooling and Implementation Stack

Limitations

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SonoNet – Speech Emotion Recognition Analytics

Project Goals

Dataset

High-Level System Architecture

End-to-end flow

Two input paradigms used

Exploratory Data Analysis Highlights

Preprocessing and Data Preparation

Spectrogram-based preprocessing (CNN + BiGRU)

Raw waveform preprocessing (wav2vec 2.0)

Data Augmentation and Regularization Techniques

Models Used

1) Spectrogram CNN (Baseline)

2) Fine-tuned wav2vec 2.0 (Transformer-based)

3) BiGRU + Self-Attention (Sequential Spectrogram Model)

Evaluation Metrics

Metrics

Common Error Patterns Observed

Tooling and Implementation Stack

Limitations

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages