Skip to content

tambat22/SonoNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SonoNet – Speech Emotion Recognition Analytics

A deep learning project for Speech Emotion Recognition (SER) that benchmarks three complementary neural approaches: CNN (mel-spectrogram), fine-tuned wav2vec 2.0 (raw waveform), and BiGRU + Self-Attention (mel-spectrogram sequence) on the CREMA-D dataset.


Project Goals

  • Classify speech clips into 6 emotion classes:
    • Anger, Disgust, Fear, Happiness, Neutral, Sadness
  • Compare how different modeling paradigms behave on SER:
    • Spectral (image-like) learning via CNNs
    • Self-supervised pretrained speech representations via wav2vec 2.0
    • Sequential temporal modeling via BiGRU + attention
  • Improve robustness under dataset constraints using preprocessing + augmentation + regularization

Dataset

CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset)

  • 7,000+ audio recordings
  • 91 actors (48 male, 43 female), ages 20–74
  • 6 categorical emotions: anger, disgust, fear, happy, neutral, sadness
  • 12 fixed sentence prompts (identified by a 3-letter acronym in filenames)
  • 4 emotion intensity levels: Low (LO), Medium (MD), High (HI), Unspecified (XX)

Note: The dataset includes intensity in filenames, but the focus here is 6-way emotion classification.


High-Level System Architecture

End-to-end flow

  1. Ingest audio from CREMA-D
  2. Standardize duration (pad/trim)
  3. Feature representation (depends on model family)
  4. Train model
  5. Evaluate using accuracy, precision/recall/F1, and confusion matrices

Two input paradigms used

  • Mel-spectrogram pipeline → used by CNN and BiGRU
  • Raw waveform pipeline → used by wav2vec 2.0

Exploratory Data Analysis Highlights

  • Dataset is largely balanced across emotion classes
    • Most classes have ~1,271 clips
    • Neutral has slightly fewer (~1,087)
  • Mean audio duration is about 2.54 seconds, with per-emotion averages roughly 2.34–2.78 seconds
  • These findings informed the use of a fixed 3-second target window for spectrogram-based models

Preprocessing and Data Preparation

Spectrogram-based preprocessing (CNN + BiGRU)

Audio is converted into a fixed-size log-mel spectrogram:

  • Sample rate: 16 kHz
  • Target duration: 3 seconds
    • Pad shorter clips
    • Truncate longer clips
  • Mel-spectrogram parameters:
    • 128 mel bins
    • FFT window (n_fft): 2048
    • Hop length: 512
  • Convert power spectrogram → decibel (log) scale
  • Pad/trim in time axis to produce consistent tensor shapes for batch training

Raw waveform preprocessing (wav2vec 2.0)

  • Audio is passed as raw waveform to wav2vec2’s feature extractor / frontend
  • Avoids manual feature engineering (e.g., MFCCs or mel-spectrograms) by leveraging pretrained representations

Data Augmentation and Regularization Techniques

To reduce overfitting and increase robustness:

  • SpecAugment (spectrogram augmentation)
    • Random time masking and frequency masking
  • Gaussian noise injection
  • Normalization
  • Dropout (notably effective when increased deeper in the CNN)
  • Early stopping (stop when validation loss fails to improve)

Models Used

SonoNet benchmarks three model families. Each is designed to capture a different aspect of emotional information in speech.


1) Spectrogram CNN (Baseline)

Input: log-mel spectrogram (image-like time–frequency map)

Motivation: Use CNNs to learn hierarchical spectral patterns associated with emotional cues (pitch-energy distribution, timbre, and spectral-temporal activations).

Architecture:

  • 4 convolutional blocks
    • Convolution + ReLU
    • Max pooling
    • Batch normalization
    • Progressively increasing dropout in downstream layers
  • Final classifier for 6 emotion categories

Results:

  • Iteration showed clear tradeoffs:
    • smaller CNNs underfit
    • deeper CNNs overfit
    • fixed dropout alone wasn’t sufficient
  • A strong improvement came from combining:
    • max pooling
    • progressive dropout
    • SpecAugment + controlled Gaussian noise injection
  • Final reported performance: ~65% accuracy on both training and test sets, with stable convergence after augmentation was introduced.

CNN Diagrams (accuracy, loss, confusion):


2) Fine-tuned wav2vec 2.0 (Transformer-based)

Input: raw waveform

Motivation: Leverage self-supervised pretrained speech representations to reduce manual feature engineering and improve emotion recognition by transferring robust acoustic representations into the SER task.

Architecture (high level):

  • Feature extractor: 7 stacked temporal convolution layers (downsampling + low-level acoustic pattern learning)
  • Feature projection: linear map into transformer embedding space
  • Transformer encoder: multi-head self-attention + feedforward blocks (with positional info, dropout, layer norm)
  • Classification head: linear layer mapping to 6 outputs

Results:

  • Reported performance: ~69% test accuracy, showing strong transfer from pretrained wav2vec2 representations.
  • Training converged quickly, but fine-tuning exhibited overfitting:
    • overfitting emerges around epochs 6–8
    • training accuracy rose rapidly from ~50% to ~90% by later epochs
    • validation accuracy plateaued around ~68%
    • widening gap of roughly 20–22 percentage points
  • Confusion-matrix analysis highlights recurring confusions among prosodically similar emotions (notably sadness ↔ fear), and difficulty with disgust.

wav2vec 2.0 Diagrams (accuracy, loss, confusion):


3) BiGRU + Self-Attention (Sequential Spectrogram Model)

Input: log-mel spectrogram interpreted as a sequence of spectral vectors over time

Motivation: Model temporal dynamics of emotion (prosody, pitch contours, energy shifts) using recurrent layers, then use attention to focus on the most emotionally salient frames.

Architecture:

  • 3 stacked Bidirectional GRU layers
    • Captures both forward + backward context across the utterance
  • Self-attention pooling
    • Produces a weighted context vector
    • Emphasizes emotionally salient segments (e.g., stressed syllables, abrupt tonal changes)
  • Dropout
  • Fully connected classifier
  • Training techniques referenced include:
    • AdamW with weight decay
    • Gradient clipping
    • ReduceLROnPlateau
    • SpecAugment

Results:

  • Reported performance indicates ~65% validation accuracy (consistent with the plotted curves and discussion).
  • Confusion matrix shows strong diagonal dominance with notable confusions such as fear → sadness (a common SER challenge due to overlapping low-arousal/negative-valence acoustic traits).
  • Training dynamics (from the report’s curves):
    • training loss decreases steadily (approx. 1.6 → 0.8 over ~50 epochs)
    • training accuracy rises to ~70%
    • test/validation accuracy plateaus around ~65%
    • mild train–test gap suggests mild overfitting, improved by regularization + SpecAugment

BiGRU Diagrams (accuracy, loss, confusion):


Evaluation Metrics

Models are evaluated using:

  • Accuracy
  • Precision / Recall / F1 (class-wise)
  • Confusion matrices

Metrics

Emotion CNN (Precision/Recall/F1) wav2vec2 (Precision/Recall/F1) BiGRU (Precision/RecallR/F1)
Anger 0.81 / 0.63 / 0.71 0.72 / 0.85 / 0.78 0.74 / 0.81 / 0.78
Disgust 0.51 / 0.66 / 0.58 0.80 / 0.39 / 0.52 0.64 / 0.57 / 0.60
Fear 0.59 / 0.54 / 0.56 0.45 / 0.81 / 0.58 0.57 / 0.52 / 0.54
Happiness 0.67 / 0.57 / 0.62 0.73 / 0.65 / 0.69 0.61 / 0.60 / 0.61
Neutral 0.76 / 0.64 / 0.69 0.75 / 0.72 / 0.74 0.65 / 0.72 / 0.68
Sadness 0.55 / 0.72 / 0.62 0.63 / 0.41 / 0.50 0.57 / 0.59 / 0.58

Common Error Patterns Observed

  • Disgust / Fear / Sadness are frequently confusable (shared negative-valence acoustic traits)
  • wav2vec2 shows strong performance on some classes but weaker balance on others, consistent with confusion-matrix discussion
  • BiGRU shows notable fear ↔ sadness confusions, consistent with prosodic overlap

Tooling and Implementation Stack

  • PyTorch for model training and experimentation
  • librosa for mel-spectrogram extraction and audio preprocessing (spectrogram models)
  • wav2vec2 via the Hugging Face / PyTorch ecosystem for pretrained raw speech modeling
  • Training performed on GPU (NVIDIA T4)

Limitations

  • Compute constraints (single T4 GPU) limited:
    • number of experiments / hyperparameter sweeps
    • sequence length / model depth exploration
  • Overfitting observed, especially for larger-capacity models (wav2vec2, BiGRU)

Future Improvements

  • Stronger overfitting mitigation for wav2vec2 (e.g., freezing strategies, stronger regularization)
  • Targeted augmentation to separate confusable classes (pitch/tempo perturbations)
  • Hybrid/ensemble approaches combining CNN spectral learning + wav2vec2 representations
  • Incorporate arousal/intensity features into model

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors