A deep learning project for Speech Emotion Recognition (SER) that benchmarks three complementary neural approaches: CNN (mel-spectrogram), fine-tuned wav2vec 2.0 (raw waveform), and BiGRU + Self-Attention (mel-spectrogram sequence) on the CREMA-D dataset.
- Classify speech clips into 6 emotion classes:
- Anger, Disgust, Fear, Happiness, Neutral, Sadness
- Compare how different modeling paradigms behave on SER:
- Spectral (image-like) learning via CNNs
- Self-supervised pretrained speech representations via wav2vec 2.0
- Sequential temporal modeling via BiGRU + attention
- Improve robustness under dataset constraints using preprocessing + augmentation + regularization
CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset)
- 7,000+ audio recordings
- 91 actors (48 male, 43 female), ages 20–74
- 6 categorical emotions: anger, disgust, fear, happy, neutral, sadness
- 12 fixed sentence prompts (identified by a 3-letter acronym in filenames)
- 4 emotion intensity levels: Low (LO), Medium (MD), High (HI), Unspecified (XX)
Note: The dataset includes intensity in filenames, but the focus here is 6-way emotion classification.
- Ingest audio from CREMA-D
- Standardize duration (pad/trim)
- Feature representation (depends on model family)
- Train model
- Evaluate using accuracy, precision/recall/F1, and confusion matrices
- Mel-spectrogram pipeline → used by CNN and BiGRU
- Raw waveform pipeline → used by wav2vec 2.0
- Dataset is largely balanced across emotion classes
- Most classes have ~1,271 clips
- Neutral has slightly fewer (~1,087)
- Mean audio duration is about 2.54 seconds, with per-emotion averages roughly 2.34–2.78 seconds
- These findings informed the use of a fixed 3-second target window for spectrogram-based models
Audio is converted into a fixed-size log-mel spectrogram:
- Sample rate: 16 kHz
- Target duration: 3 seconds
- Pad shorter clips
- Truncate longer clips
- Mel-spectrogram parameters:
- 128 mel bins
- FFT window (n_fft): 2048
- Hop length: 512
- Convert power spectrogram → decibel (log) scale
- Pad/trim in time axis to produce consistent tensor shapes for batch training
- Audio is passed as raw waveform to wav2vec2’s feature extractor / frontend
- Avoids manual feature engineering (e.g., MFCCs or mel-spectrograms) by leveraging pretrained representations
To reduce overfitting and increase robustness:
- SpecAugment (spectrogram augmentation)
- Random time masking and frequency masking
- Gaussian noise injection
- Normalization
- Dropout (notably effective when increased deeper in the CNN)
- Early stopping (stop when validation loss fails to improve)
SonoNet benchmarks three model families. Each is designed to capture a different aspect of emotional information in speech.
Input: log-mel spectrogram (image-like time–frequency map)
Motivation: Use CNNs to learn hierarchical spectral patterns associated with emotional cues (pitch-energy distribution, timbre, and spectral-temporal activations).
Architecture:
- 4 convolutional blocks
- Convolution + ReLU
- Max pooling
- Batch normalization
- Progressively increasing dropout in downstream layers
- Final classifier for 6 emotion categories
Results:
- Iteration showed clear tradeoffs:
- smaller CNNs underfit
- deeper CNNs overfit
- fixed dropout alone wasn’t sufficient
- A strong improvement came from combining:
- max pooling
- progressive dropout
- SpecAugment + controlled Gaussian noise injection
- Final reported performance: ~65% accuracy on both training and test sets, with stable convergence after augmentation was introduced.
CNN Diagrams (accuracy, loss, confusion):
Input: raw waveform
Motivation: Leverage self-supervised pretrained speech representations to reduce manual feature engineering and improve emotion recognition by transferring robust acoustic representations into the SER task.
Architecture (high level):
- Feature extractor: 7 stacked temporal convolution layers (downsampling + low-level acoustic pattern learning)
- Feature projection: linear map into transformer embedding space
- Transformer encoder: multi-head self-attention + feedforward blocks (with positional info, dropout, layer norm)
- Classification head: linear layer mapping to 6 outputs
Results:
- Reported performance: ~69% test accuracy, showing strong transfer from pretrained wav2vec2 representations.
- Training converged quickly, but fine-tuning exhibited overfitting:
- overfitting emerges around epochs 6–8
- training accuracy rose rapidly from ~50% to ~90% by later epochs
- validation accuracy plateaued around ~68%
- widening gap of roughly 20–22 percentage points
- Confusion-matrix analysis highlights recurring confusions among prosodically similar emotions (notably sadness ↔ fear), and difficulty with disgust.
wav2vec 2.0 Diagrams (accuracy, loss, confusion):

Input: log-mel spectrogram interpreted as a sequence of spectral vectors over time
Motivation: Model temporal dynamics of emotion (prosody, pitch contours, energy shifts) using recurrent layers, then use attention to focus on the most emotionally salient frames.
Architecture:
- 3 stacked Bidirectional GRU layers
- Captures both forward + backward context across the utterance
- Self-attention pooling
- Produces a weighted context vector
- Emphasizes emotionally salient segments (e.g., stressed syllables, abrupt tonal changes)
- Dropout
- Fully connected classifier
- Training techniques referenced include:
- AdamW with weight decay
- Gradient clipping
- ReduceLROnPlateau
- SpecAugment
Results:
- Reported performance indicates ~65% validation accuracy (consistent with the plotted curves and discussion).
- Confusion matrix shows strong diagonal dominance with notable confusions such as fear → sadness (a common SER challenge due to overlapping low-arousal/negative-valence acoustic traits).
- Training dynamics (from the report’s curves):
- training loss decreases steadily (approx. 1.6 → 0.8 over ~50 epochs)
- training accuracy rises to ~70%
- test/validation accuracy plateaus around ~65%
- mild train–test gap suggests mild overfitting, improved by regularization + SpecAugment
BiGRU Diagrams (accuracy, loss, confusion):
Models are evaluated using:
- Accuracy
- Precision / Recall / F1 (class-wise)
- Confusion matrices
| Emotion | CNN (Precision/Recall/F1) | wav2vec2 (Precision/Recall/F1) | BiGRU (Precision/RecallR/F1) |
|---|---|---|---|
| Anger | 0.81 / 0.63 / 0.71 | 0.72 / 0.85 / 0.78 | 0.74 / 0.81 / 0.78 |
| Disgust | 0.51 / 0.66 / 0.58 | 0.80 / 0.39 / 0.52 | 0.64 / 0.57 / 0.60 |
| Fear | 0.59 / 0.54 / 0.56 | 0.45 / 0.81 / 0.58 | 0.57 / 0.52 / 0.54 |
| Happiness | 0.67 / 0.57 / 0.62 | 0.73 / 0.65 / 0.69 | 0.61 / 0.60 / 0.61 |
| Neutral | 0.76 / 0.64 / 0.69 | 0.75 / 0.72 / 0.74 | 0.65 / 0.72 / 0.68 |
| Sadness | 0.55 / 0.72 / 0.62 | 0.63 / 0.41 / 0.50 | 0.57 / 0.59 / 0.58 |
- Disgust / Fear / Sadness are frequently confusable (shared negative-valence acoustic traits)
- wav2vec2 shows strong performance on some classes but weaker balance on others, consistent with confusion-matrix discussion
- BiGRU shows notable fear ↔ sadness confusions, consistent with prosodic overlap
- PyTorch for model training and experimentation
- librosa for mel-spectrogram extraction and audio preprocessing (spectrogram models)
- wav2vec2 via the Hugging Face / PyTorch ecosystem for pretrained raw speech modeling
- Training performed on GPU (NVIDIA T4)
- Compute constraints (single T4 GPU) limited:
- number of experiments / hyperparameter sweeps
- sequence length / model depth exploration
- Overfitting observed, especially for larger-capacity models (wav2vec2, BiGRU)
- Stronger overfitting mitigation for wav2vec2 (e.g., freezing strategies, stronger regularization)
- Targeted augmentation to separate confusable classes (pitch/tempo perturbations)
- Hybrid/ensemble approaches combining CNN spectral learning + wav2vec2 representations
- Incorporate arousal/intensity features into model







