This project implements and compares two main approaches for speech recognition:
- Wav2Vec2 Embeddings Approach: pre-trained Wav2Vec2 transformers for extracting audio features, followed by simple classifiers (MLP/RNN)
- CNN Spectrogram Approach: Audio converted to mel spectrograms and passed to convolutional neural network architecture
The train.py script provides a unified interface for training all model types. Here's how to use it:
python train.py --model <MODEL_TYPE> --data_type <DATA_TYPE> --data_dir <DATA_DIRECTORY> --config <CONFIG_FILE> --checkpoint_dir <CHECKPOINT_DIRECTORY>mlp: Multi-Layer Perceptron (requires embeddings)rnn: Recurrent Neural Network (requires embeddings)cnn: Convolutional Neural Network (requires spectrograms)cnn_staging: Two-stage CNN system (requires spectrograms)
Train MLP with embeddings:
python train.py --model mlp --data_type embeddings --data_dir embeddings --config configs/mlp_config.json --checkpoint_dir checkpoints/MLP_experiment --epochs 40Train CNN with spectrograms:
python train.py --model cnn --data_type spectrograms --data_dir spectrograms --config configs/cnn_config.json --checkpoint_dir checkpoints/CNN_experiment --epochs 60Train RNN with custom batch size:
python train.py --model rnn --data_type embeddings --data_dir embeddings --config configs/rnn_config.json --checkpoint_dir checkpoints/RNN_experiment --epochs 50 --batch_size 64Create JSON configuration files in the configs/ directory. Example for MLP:
{
"input_size": 768,
"hidden_dim": 256,
"num_classes": 12,
"dropout": 0.3,
"optimizer": "adam",
"lr": 0.001,
"weighted_loss": true
}--model: Model architecture (mlp/rnn/cnn/cnn_staging)--data_type: Input data type (embeddings/spectrograms)--data_dir: Directory containing preprocessed data files--config: Path to JSON configuration file--checkpoint_dir: Directory to save model checkpoints--epochs: Number of training epochs (default: 40)--batch_size: Training batch size (default: 32)--weighted_sampling: Enable weighted sampling for imbalanced datasets--device: Training device (auto/cpu/cuda, default: auto)
Prepare your dataset by downloading audio files to the data/ directory and running the appropriate preprocessing script - precompute_embeddings.py for MLP/RNN models or precompute_spectrograms.py for CNN models.