Dysarthric Voice Command Classifier

A deep learning system for recognizing voice commands from speakers with dysarthria, built on HuBERT with curriculum learning. Achieves ~87% accuracy on dysarthric speech using the TORGO dataset.

Overview

Dysarthria is a motor speech disorder that affects the muscles used for speaking, making speech difficult to understand. This project provides an accessible voice command interface specifically designed for individuals with dysarthric speech patterns.

Key Features:

20 voice commands (10 digits + 10 directional/action commands)
HuBERT-large based architecture with learned attention pooling
Curriculum learning: pre-train on control speakers, fine-tune on dysarthric speakers
Gradio web interface for real-time inference (record or upload audio)
Comprehensive audio augmentation pipeline

Architecture

flowchart LR
    subgraph input [Input]
        Audio[Audio Waveform]
    end
    subgraph hubert [HuBERT Pretrained]
        FE[CNN Feature Extractor]
        FP[Feature Projection]
        Enc[Transformer Encoder x24]
    end
    subgraph custom [Custom Layers]
        AP[Attention Pooling]
        MLP[2-Layer MLP Classifier]
    end
    subgraph output [Output]
        Pred[Command Prediction]
    end
    Audio --> FE --> FP --> Enc --> AP --> MLP --> Pred

The model uses:

HuBERT-large (315M parameters) as the speech encoder
Attention pooling to learn which audio frames are most important for classification
2-layer MLP classifier with GELU activation and dropout

Development Journey

This project evolved through iterative improvements:

Initial Attempt (~20% accuracy)

HuBERT-large with linear classifier
Mean pooling over time dimension
No data augmentation
All 46 classes (digits + commands + radio alphabet)

Key Improvements

Replaced mean pooling with attention pooling — learns which frames matter most
Changed linear classifier to 2-layer MLP with GELU activation and dropout
Added comprehensive data augmentation pipeline (noise, pitch shift, time stretch, gain, SpecAugment)
Implemented curriculum learning (control speakers → dysarthric speakers)

Scaling Strategy

First validated approach with 10 digit classes (zero-nine)
Once successful, expanded to 20 classes (10 digits + 10 directional/action commands)
Architecture supports scaling to full 46 classes if needed

Project Structure

dysarthric-voice-cmds/
├── app/
│   └── gradio_app.py          # Web interface for inference
├── src/
│   ├── config.py              # Centralized configuration
│   ├── data/
│   │   ├── preprocessing.py   # TORGO dataset scanning
│   │   ├── dataset.py         # PyTorch Dataset class
│   │   └── augmentation.py    # Audio augmentation pipeline
│   ├── model/
│   │   ├── architecture.py    # HuBERT + classifier model
│   │   └── utils.py           # Model utilities
│   ├── training/
│   │   └── trainer.py         # Training and validation loops
│   └── inference/
│       └── predictor.py       # Inference wrapper class
├── scripts/
│   └── train.py               # 3-phase curriculum learning training script
├── outputs/                   # Trained models and results
├── model_cache/               # Cached HuBERT weights
├── main.ipynb                 # Main training notebook
├── FUTURE_WORK.md             # Future improvement ideas
└── requirements.txt           # Python dependencies

Installation

1. Clone the repository

git clone <repository-url>
cd dysarthric-voice-cmds

2. Install dependencies

pip install -r requirements.txt

Requirements:

torch, torchaudio
transformers
librosa
scikit-learn
pandas, matplotlib, seaborn
audiomentations
gradio>=4.0.0

3. Download the TORGO dataset

Download from: TORGO Database

Update the TORGO_ROOT path in src/config.py to point to your TORGO directory.

4. Download HuBERT model

The HuBERT model will be automatically downloaded via ModelScope on first training run. The model is cached in model_cache/ for subsequent runs.

Alternatively, you can pre-cache it:

from modelscope import snapshot_download

model_dir = snapshot_download("facebook/hubert-large-ls960-ft", cache_dir="model_cache")

Quick Start

Training the Model

# Full training (Phase A + B + C with LOSO evaluation)
python scripts/train.py

# Skip LOSO evaluation for faster training
python scripts/train.py --skip-phase-c

# Customize training epochs
python scripts/train.py --epochs-a 10 --epochs-b 10 --epochs-loso 5

# Full options
python scripts/train.py \
    --epochs-a 20 \           # Phase A epochs (control pretraining)
    --epochs-b 20 \           # Phase B epochs (dysarthric fine-tuning)
    --epochs-loso 10 \        # Phase C LOSO fine-tuning epochs per fold
    --batch-size 8 \          # Batch size
    --seed 42 \               # Random seed for reproducibility
    --skip-phase-c            # Skip LOSO evaluation

Output artifacts:

outputs/phase_a_control_pretrained.pt - Phase A checkpoint
outputs/phase_b_curriculum_trained.pt - Phase B checkpoint (main model)
outputs/curriculum_fold{N}_{speaker}.pt - Per-fold models from Phase C
outputs/curriculum_cv_results.csv - Cross-validation results
outputs/curriculum_cv_results.json - JSON format results
outputs/label_mapping.json - Label encoding

Running the Web Interface

python app/gradio_app.py

Command-line options:

python app/gradio_app.py \
    --checkpoint outputs/phase_b_curriculum_trained.pt \
    --labels outputs/label_mapping.json \
    --model-cache model_cache/facebook/hubert-large-ls960-ft \
    --port 7860 \
    --share  # Creates a public URL

Inference in Python

from src.inference.predictor import VoiceCommandPredictor

predictor = VoiceCommandPredictor(
    checkpoint_path="outputs/phase_b_curriculum_trained.pt",
    label_mapping_path="outputs/label_mapping.json",
    model_cache_dir="model_cache/facebook/hubert-large-ls960-ft"
)

# From file
result = predictor.predict_file("path/to/audio.wav")

# From numpy array
result = predictor.predict(audio_array, sample_rate=16000)

print(f"Predicted: {result['label']} ({result['confidence']*100:.1f}%)")

Supported Commands

Category	Commands
Digits	zero, one, two, three, four, five, six, seven, eight, nine
Actions	yes, no, up, down, left, right, forward, back, select, menu

Training Methodology

Curriculum Learning

The model uses a three-phase curriculum learning approach:

Phase A: Control Speaker Pre-training

Train on non-dysarthric (control) speakers
Speech patterns closer to HuBERT's original training data
Establishes strong baseline representations

Phase B: Dysarthric Fine-tuning

Fine-tune on dysarthric speakers only
Lower learning rate to preserve Phase A knowledge
Adapts to dysarthric speech patterns

Phase C: Leave-One-Speaker-Out (LOSO) Evaluation

Cross-validation leaving one dysarthric speaker out for testing
Ensures model generalizes to unseen speakers

Sub-Phase Training (within each curriculum phase)

Each curriculum phase uses a two-stage training approach:

Warmup Stage: Train only the classifier head with encoder frozen
Fine-tuning Stage: Unfreeze top N transformer layers with differential learning rates

Phase A (Control Pretraining):

Warmup: 1/3 of total epochs, classifier only
Fine-tune: 2/3 of total epochs, top layers unfrozen

Phase B (Dysarthric Fine-tuning):

Full fine-tuning with lower learning rates to preserve Phase A knowledge

Command-Line Arguments

Argument	Description	Default
`--epochs-a`	Phase A epochs (control pretraining)	from config
`--epochs-b`	Phase B epochs (dysarthric fine-tuning)	from config
`--epochs-loso`	Phase C LOSO fine-tuning epochs per fold	10
`--batch-size`	Training batch size	from config
`--seed`	Random seed for reproducibility	42
`--skip-phase-c`	Skip LOSO evaluation	False

Hyperparameters

Parameter	Value
Batch size	8
Control pretraining LR	1e-4
Control fine-tune LR	5e-5
Dysarthric LR (classifier)	5e-5
Dysarthric LR (encoder)	5e-6
Weight decay	0.01
Unfrozen encoder layers	4 (top)
Max audio length	3 seconds
Sample rate	16kHz
Class weighting	Enabled (balanced)

See src/config.py for all configurable parameters.

Dataset: TORGO

The TORGO database contains acoustic and articulatory speech data from speakers with dysarthria.

Speakers:

8 dysarthric speakers (F01, F03, F04, M01-M05)
7 control speakers (FC01-FC03, MC01-MC04)

Microphone Types:

wav_arrayMic: Acoustic Magic array microphone (recommended, better quality)
wav_headMic: Head-mounted microphone

Data Augmentation

The training pipeline applies several augmentation techniques to improve robustness:

Time-domain augmentations (using audiomentations):

Gaussian noise injection
Pitch shifting (±2 semitones)
Time stretching (0.9x - 1.1x)
Time shifting (±20%)
Gain adjustment (±6 dB)

SpecAugment-style augmentation:

Time masking (zeros out random time segments)

All augmentation parameters are configurable in src/config.py.

Model Architecture Details

HuBERT Encoder

Model: facebook/hubert-large-ls960-ft
Parameters: 315M total
Architecture: 24 transformer layers, 1024 hidden size
Pre-training: Self-supervised on LibriSpeech 960h

The CNN feature extractor and feature projection layers are always frozen. During fine-tuning, only the top 4 transformer layers are unfrozen.

Attention Pooling

Instead of simple mean pooling, the model uses learned attention pooling:

class AttentionPooling(nn.Module):
    def __init__(self, hidden_size):
        self.attention = nn.Sequential(
            nn.Linear(hidden_size, hidden_size // 4),
            nn.Tanh(),
            nn.Linear(hidden_size // 4, 1)
        )
    
    def forward(self, hidden_states):
        attention_weights = softmax(self.attention(hidden_states), dim=1)
        return (hidden_states * attention_weights).sum(dim=1)

This allows the model to learn which audio frames are most relevant for classification.

Classification Head

self.classifier = nn.Sequential(
    nn.Linear(1024, 512),      # hidden_size -> hidden_size/2
    nn.GELU(),
    nn.Dropout(0.1),
    nn.Linear(512, num_labels)  # -> num_classes
)

Results

Cross-validation accuracy: ~87% on dysarthric speakers
Evaluation method: Leave-one-speaker-out (LOSO) cross-validation

Output artifacts in outputs/:

confusion_matrix.png - Confusion matrix visualization
per_class_accuracy.png - Per-class accuracy breakdown
curriculum_per_speaker_accuracy.png - Accuracy by speaker
curriculum_cv_results.json - Detailed cross-validation results

API Reference

VoiceCommandPredictor

from src.inference.predictor import VoiceCommandPredictor

predictor = VoiceCommandPredictor(
    checkpoint_path: str,      # Path to .pt checkpoint
    label_mapping_path: str,   # Path to label_mapping.json
    model_cache_dir: str,      # Path to cached HuBERT model
    device: str = None         # 'cuda', 'cpu', or auto-detect
)

# Predict from numpy array
result = predictor.predict(audio: np.ndarray, sample_rate: int) -> dict

# Predict from file
result = predictor.predict_file(audio_path: str) -> dict

# Returns: {'label': str, 'confidence': float, 'probabilities': dict}

HuBERTForCommandClassification

from src.model.architecture import HuBERTForCommandClassification

model = HuBERTForCommandClassification(
    model_path: str,           # Path to HuBERT model
    num_labels: int,           # Number of classes
    hidden_size: int = 1024,
    classifier_dropout: float = 0.1,
    freeze_encoder: bool = True,
    freeze_feature_extractor: bool = True
)

# Forward pass
outputs = model(
    input_values: torch.Tensor,           # (batch, seq_len)
    attention_mask: torch.Tensor = None,
    labels: torch.Tensor = None,          # For computing loss
    class_weights: torch.Tensor = None
) -> dict  # {'logits': Tensor, 'loss': Tensor (if labels provided)}

# Unfreeze top N encoder layers for fine-tuning
model.unfreeze_encoder(num_layers=4)

TORGOCommandDataset

from src.data.dataset import TORGOCommandDataset

dataset = TORGOCommandDataset(
    df: pd.DataFrame,          # DataFrame with file_path, label_id columns
    feature_extractor,         # Wav2Vec2FeatureExtractor
    config,                    # Config object
    max_length: int = 48000,   # 3 seconds at 16kHz
    target_sr: int = 16000,
    augment: bool = False      # Enable augmentation for training
)

Future Work

See FUTURE_WORK.md for planned improvements:

Dysarthria detection for adaptive ASR routing
ESPNet integration for larger-scale ASR

Acknowledgments

TORGO Dataset

Rudzicz, F., Namasivayam, A.K., Wolff, T. (2012) The TORGO database of acoustic and articulatory speech from speakers with dysarthria. Language Resources and Evaluation, 46(4), pages 523-541.

HuBERT

Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A. (2021) HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. arXiv:2106.07447

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
demo_samples		demo_samples
outputs		outputs
scripts		scripts
src		src
.gitignore		.gitignore
DEPLOY_HF_SPACES.md		DEPLOY_HF_SPACES.md
FUTURE_WORK.md		FUTURE_WORK.md
LICENSE		LICENSE
README.md		README.md
app.py		app.py
main.ipynb		main.ipynb
requirements.txt		requirements.txt
requirements_hf.txt		requirements_hf.txt

Folders and files

Latest commit

History

Repository files navigation

Dysarthric Voice Command Classifier

Overview

Architecture

Development Journey

Project Structure

Installation

1. Clone the repository

2. Install dependencies

3. Download the TORGO dataset

4. Download HuBERT model

Quick Start

Training the Model

Running the Web Interface

Inference in Python

Supported Commands

Training Methodology

Curriculum Learning

Sub-Phase Training (within each curriculum phase)

Command-Line Arguments

Hyperparameters

Dataset: TORGO

Data Augmentation

Model Architecture Details

HuBERT Encoder

Attention Pooling

Classification Head

Results

API Reference

VoiceCommandPredictor

HuBERTForCommandClassification

TORGOCommandDataset

Future Work

Acknowledgments

TORGO Dataset

HuBERT

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages