Downstream tasks

Introduction

Self-supervised (SSL) pretrained models are not able to justify their effectiveness through pretraining loss. One always has to evaluate their performance with downstream tasks. Hence, it is crucial to collect a wide range of downstream tasks and make the evaluation pipeline as easy as possible to speed up the development cycle.

We develop several downstream tasks for evaluating SSL models, each of them is defined by a sub-folder under this downstream folder. We further select representative ones to form the following benchmark:

SUPERB: Speech processing Universal PERformance Benchmark

How to use

I. General requirement

Clone the repository and install dependencies
See the General Usage to have a sense on the conceptual usage

II A. Run the developed tasks

Optional: Register your customized pretrained model
- You can also start from evaluating pretrained models available in this toolkit.
Follow the task-specific usages
- SUPERB Benchmark
- More tasks

II B. Develop new tasks

Check Add new downstream tasks. Pull requests are always welcome. Thanks!

General usage

All of the downstream task follow the following command pattern, with a few task-specific adjustments which are detailed in the follow-up task-specific sections.

Start a new downstream training experiment

# general pattern
python3 run_downstream.py -m train -n ExpName -u UpstreamName -d DownstreamName
# a directly runnable example without data preparation
python3 run_downstream.py -m train -n ExpName -u fbank -d example

-m or --mode specifies the train/evaluate mode
-u or --upstream specifies the upstream pretrained model.
- The available upstream can be checked by -h
-d or --downstream specifies the downstream task.
- The available downstream can be checked by -h
- Each available downstream task has its corresponding folder under downstream/. Eg. -d asr means we are using the task defined in downstream/asr/
- example is a pseudo downstream task which is useful for testing the upstream model or as an initial template for developing a new downstream task
-f or --upstream_trainable enables finetuning the upstream model on the downstream task. Default: false
-n or --name specifies the experiment name, all the files related to this run will be saved into expdir=result/downstream/{args.name}. (You can also use -p or --expdir to directly specify the path of expdir.)
- command
- config file
- Tensorboard event file
- checkpoints, each contains
  - arguments
  - config
  - latest optimization step
  - latest optimization epoch
  - state_dict of models, optimizer, scheduler
-c or --config specifies the config file path. If not specified, will use the config.yaml under each downstream folder by default. Eg. result/asr/config.yaml
-o or --override can override any argument or config field with command line, which is at the highest priority. Please refer to the override function for definition. Here is an example to override 3 fields defined in this config file:
```
-o "config.optimizer.lr=1.0e-3,,config.optimizer.name='AdamW',,config.runner.eval_dataloaders=['dev', 'test']"
```

Resume training from a checkpoint

# [ckpt] can be the path of a checkpoint or its residing directory.
python3 run_downstream.py -m train -e [ckpt]

The -e or --past_exp option is designed to use the exact same arguments and config as the previous training experiment except the training/evaluation mode. (Each checkpoint will save arguments and config.)
-o can be used to further override the arguments & configs loaded from the checkpoint, since -o is at the highest priority.

Fault-tolerant training

for i in $(seq 1 100); do
    python3 run_downstream.py -m train -n ExpName -u fbank -d example -a
done

The -a option stands for automatic resuming, will resume the checkpoint when there is a latest checkpoint resides in expdir directory or start a new training experiment when there is none.

run_while.sh under the root directory of the repo is a helping wrapper for this. For any COMMAND you wish to run in a while loop, you can just

./run_while.sh COMMAND

Eg.

./run_while.sh python3 run_downstream.py -a -m train -n ExpName -u fbank -d example

Please must remember to use -a when wrap with run_while.sh, or else you are going to re-launch a new training experiment for every loop, which will be a disaster expecially for Tensorboard event files.

Distributed training

We wrap the model with DistributedDataParallel. By inserting -m torch.distributed.launch --nproc_per_node {GPU_NUM} between python3 and run_downstream.py, you can directly turn the above training commands into distributed training. Currently only ASR and ASV support distributed training.

First specify your GPU number

gpus=16;
distributed="-m torch.distributed.launch --nproc_per_node ${gpus}";

Simple training

python3 $distributed run_downstream.py -m train -n ExpName -u fbank -d example

Resume training

# The $distributed value should be same as the original training experiment.
# [ckpt] can be the path of a checkpoint or its residing directory.
python3 $distributed run_downstream.py -m train -e [ckpt]

Fault-tolerant training

for i in $(seq 1 100); do
    python3 $distributed run_downstream.py -m train -n ExpName -u fbank -d example -a
    # When one of the spawned process dies, sometimes not all processes are terminated synchronizely.
    # You might need to ensure all the spawned process are killed here.
    # `killall` linux command is suitable for this.
done

Test a checkpoint

The following test-clean is an example for the name of the testing dataset, and the supported name is defined by each downstream expert's get_dataloader. Typically dev and test are supported for task/dataset with the standard split.

Preferable: Use the same args & config as training time

# [ckpt] can be the path of a checkpoint or its residing directory.
python3 run_downstream.py -m evaluate -t "test-clean" -e [ckpt]

The -e or --past_exp option is designed to use the exact same arguments and config as the previous training experiment except the training/evaluation mode. (Each checkpoint will save arguments and config.)
-o can be used to further override the arguments & configs loaded from the checkpoint, since -o is at the highest priority.

Alternative: Use another set of args & config for testing

Most of the time the above command is enough. But if you find overridding args & configs stored in the trained checkpoint one-by-one cumbersome, you can first prepare a new set of args & config and only load the model weights in the trained checkpoint.

# [ckpt] can be the path of a checkpoint or its residing directory.
# [upstream], [downstream] and other args should be taken care by the user and won't loaded from the checkpoint.
# [config] is the newly prepared testing config
python3 run_downstream.py -m evaluate -t "test-clean" -i [ckpt] -u [upstream] -d [downstream] -c [config] -n TestExpName

The -i or--init_ckpt option is designed to load a checkpoint without overwriting args & config, which enables flexible configuration for testing stage while user should take care of using the same upstream & downstream arguments as training time. Since the command and configs will all be saved into expdir, you can double check the setting by files in expdir of the previous training experiment.

Test the distributed trained checkpoint

Only the training part is powered by DistributedDataParallel, and we save all the model state_dict without the DDP wrapper. That is, after the DDP training, you can always evaluate the checkpoint using the testing command documented above (on single GPU).

SUPERB Benchmark

In this section we detail the commands for reproducing the paper SUPERB: Speech processing Universal PERformance Benchmark.

PR: Phoneme Recognition

Specified by the command -d ctc

Prepare data

Download LibriSpeech and unzip. Only need train-clean-100, dev-clean, and test-clean.

Check the prepared file structure

LibriSpeech/
├── train-clean-100/
├── dev-clean/
└── test-clean/

Change the path in downstream/ctc/libriphone.yaml

downstream_expert:
    corpus:
        path: "root directory of LibriSpeech"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d ctc -c downstream/ctc/libriphone.yaml

Testing

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

ASR: Automatic Speech Recognition

Specified by the command -d asr

Prepare data

Download LibriSpeech and unzip. Only need train-clean-100, dev-clean, and test-clean.

Check the prepared file structure

LibriSpeech/
├── train-clean-100/
├── dev-clean/
└── test-clean/

Change the path in downstream/asr/config.yaml

downstream_expert:
    datarc:
        libri_root: "root directory of LibriSpeech"

Prepare the lengths for utterances in LibriSpeech's train-clean-100, dev-clean and test-clean:

# Official LibriSpeech is in .flac format
python3 preprocess/generate_len_for_bucket.py -i "root directory of LibriSpeech" -o data/librispeech -a .flac --n_jobs 12

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d asr

Testing without LM

python3 run_downstream.py -m evaluate -t "test-clean" -e result/downstream/dev-clean-best.ckpt

Testing with KenLM + LibriSpeech official 4-gram LM

I. Prepare Decoding Environment

Install KenLM
- Please follow the official installation instructions of KenLM instead of the one documented in flashlight or wav2letter
- If you encounter issues when installing KenLM, you might need to install some extra dependencies.
Install flashlight python bindings
- Only the python bindings is required instead of the entire flashlight toolkit
Download LibriSpeech official 4-gram LM
- https://www.openslr.org/resources/11/4-gram.arpa.gz
- Downloaded filename: 4-gram.arpa.gz
Download character-based lexicon
- https://dl.fbaipublicfiles.com/fairseq/wav2vec/librispeech_lexicon.lst
- Downloaded filename: librispeech_lexicon.lst
Make sure your fairseq version contains the following commit
- https://github.com/pytorch/fairseq/commit/cb84694c195afced474d17318b5e746d1a9d20a3#diff-ee3a94b6d9b5f2cc60f1b69afc075abbe2061083b52515178eb7145d59e7e7e4

II. Test

python3 run_downstream.py -m evaluate -t "test-clean" -e result/downstream/dev-best.ckpt \
    -o "\
        config.downstream_expert.datarc.decoder_args.decoder_type='kenlm',, \
        config.downstream_expert.datarc.decoder_args.kenlm_model='/path/to/4-gram.arpa.gz',, \
        config.downstream_expert.datarc.decoder_args.lexicon='/path/to/librispeech_lexicon.lst' \
       "

KS: Keyword Spotting

Specified by the command -d speech_commands

Prepare data

Download data
- http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz
- http://download.tensorflow.org/data/speech_commands_test_set_v0.01.tar.gz

Download and unpack Speech Commands

mkdir -p /CORPORA_DIR/speech_commands_v0.01
tar zxf speech_commands_v0.01.tar.gz -C /CORPORA_DIR/speech_commands_v0.01

Download and unpack Speech Commands test set

mkdir -p /CORPORA_DIR/speech_commands_test_set_v0.01
tar zxf speech_commands_test_set_v0.01.tar.gz -C /CORPORA_DIR/speech_commands_test_set_v0.01

Change the following path in downstream/speech_commands/config.yaml to yours

downstream_expert:
    datarc:
        speech_commands_root: "/CORPORA_DIR/speech_commands_v0.01/"
        speech_commands_test_root: "/CORPORA_DIR/speech_commands_test_set_v0.01/"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d speech_commands

Testing

The testing is done on-the-fly with training since it is not costly. Use the following command to get the testing result from the best-dev checkpoint

python3 utility/get_best_dev.py result/downstream/ExpName/log.log

Compatible with Speech Command v2

The implementation is directly compatible with Speech Command v2. You can enable this by just changing the train/test dataset. All other steps should be the same.

QbE: Query-by-Example Spoken Term Detection

Specified by the command -d quesst14_dtw. This task does not require training. We extract representations and run dynamic time warping (DTW) on them.

Prepare data

Download QUESST14

export CORPORA_DIR="the root directory of all your datasets"    
wget https://speech.fit.vutbr.cz/files/quesst14Database.tgz
tar zxf quesst14Database.tgz -C $CORPORA_DIR

Change the path in downstream/quesst14/config.yaml

downstream:
    datarc:
        dataset_root: "CORPORA_DIR/quesst14Database"

Dynamic Time Warping (DTW)

# The default dist_fn if not specified is "cosine_exp"
# as it yields the best result for almost all upstream
# Supported dist_fn: cosine, cityblock, euclidean, cosine_exp

dist_fn=cosine;

# dev
python3 run_downstream.py -m evaluate -t "dev" -u fbank -d quesst14_dtw \
    -n ExpName_dev -o "config.downstream_expert.dtwrc.dist_method='$dist_fn'"

# test
python3 run_downstream.py -m evaluate -t "test" -u fbank -d quesst14_dtw \
    -n ExpName_test -o "config.downstream_expert.dtwrc.dist_method='$dist_fn'"

Scoring

export S3PRL_DIR=/YOUR/S3PRL/PATH
cd $CORPORA_DIR/quesst14Database/scoring

# dev
./score-TWV-Cnxe.sh $S3PRL_DIR/result/downstream/ExpName_dev \
    groundtruth_quesst14_dev -10

# test
./score-TWV-Cnxe.sh $S3PRL_DIR/result/downstream/ExpName_test \
    groundtruth_quesst14_eval -10

IC: Intent Classification - Fluent Speech Commands

Specified by the command -d fluent_commands

Prepare data

Download and unzip data
- http://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gz

Check the prepared file structure

fluent_speech_commands_dataset
├── wavs
│   └── speakers
├── data
│   └── [*.csv]
├── readme.md
└── Fluent Speech Commands Public License.pdf

Change the following paths under downstream/fluent_commands/config.yaml to your own:

downstream_expert:
    datarc:
        file_path: "root directory of fluent_speech_commands_dataset"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d fluent_commands

Testing

The testing is done on-the-fly with training since it is not costly. Use the following command to get the testing result from the best-dev ckpt.

python3 utility/get_best_dev.py result/downstream/ExpName/log.log

SF: End-to-end Slot Filling

Prepare data

Optional: Preprocess Audio SNIPS from the official version.

# Official Audio SNIPS is in mp3 format, we will convert them to wav
# We need mp3 support on sox package (originally not supported)
# First ensure you have the sox installed
# Then install the mp3 support

# apt-get
apt-get install libsox-fmt-mp3

# or yum install
yum install soxr sox-plugins-freeworld -y

# after installing the mp3 support
CORPORA_DIR="the root directory of all your datasets"
./preprocess/snips_prepare_data.sh $CORPORA_DIR

Download the preprocessed Audio SNIPS and unzip
- https://drive.google.com/file/d/1oBRZd-PaCKz5iY3eZkXs5OB_ZZ4w7bbG/view?usp=sharing

Change the paths in downstream/ctc/snips.yaml

downstream_expert:
    corpus:
        path: "CORPORA_DIR/SNIPS"
    text:
        slots_file: "CORPORA_DIR/SNIPS/slots.txt"

Train

python3 run_downstream.py -n ExpName -m train -u fbank -d ctc -c downstream/ctc/snips.yaml

Test

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/dev-best.ckpt

SID: Speaker Identification

Prepare data

Download dataset from Voxceleb1 and unzip them.

voxceleb1_root="/CORPORA_DIR/VoxCeleb1/"
mkdir -p $voxceleb1_root/dev
mkdir -p $voxceleb1_root/test

# prepare dev
cd $voxceleb1_root/dev/
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partab
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partac
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partad
cat vox1_dev* > vox1_dev_wav.zip
unzip vox1_dev_wav.zip

# prepare test
cd $voxceleb1_root/test/
wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_test_wav.zip
unzip vox1_test_wav.zip

Check prepared file structure

Voxceleb1/
├── dev/
│   └── wav/
│       └──Speaker id folders
└── test/
    └── wav/
        └──Speaker id folders

Change the path in downstream/voxceleb1/config.yaml

downstream_expert:
    datarc:
        file_path: "root directory of VoxCeleb1"

Train

python3 run_downstream.py -n ExpName -m train -u fbank -d voxceleb1

Test

The testing is done on-the-fly with training since it is not costly. Use the following command to get the testing result from the best-dev ckpt.

python3 utility/get_best_dev.py result/downstream/ExpName/log.log

ASV: Automatic Speaker Verification

Prepare data

Follow the step 1 and 2 in SID

Change the path in downstream/sv_voxceleb1/config.yaml

downstream_expert:
    datarc:
        file_path: "root directory of VoxCeleb1"

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d sv_voxceleb1

Testing

As there is no official validation set, we save checkpoints every 20000 updates and report the best EER. Evaluating checkpoints take long time so we don't test them on-the-fly on the same GPU. We opt to save all checkpoints and test them parallely with another GPU during training. The following command will run a for-loop to monitor if any new checkpoints is saved, and evaluate it if any is found. The already evaluated checkpoints will be passed as they have the result loggings under their expdir.

./run_while.sh "./downstream/sv_voxceleb1/test_expdirs.sh result/downstream/ExpName; sleep 1800;"

Report numbers

The lowest number should be reported, which should be at the bottom.

./downstream/sv_voxceleb1/report.sh result/downstream/ExpName

SD: Speaker Diarization

Prepare data

Simulate Libri2Mix Data for Diarization

S3PRL_DIR="root directory of your cloned s3prl"
CORPORA_DIR"root directory of all your datasets, which hopefully contains LibriSpeech (not necessary)"

git clone https://github.com/ftshijt/LibriMix.git
cd LibriMix
bash generate_librimix.sh $CORPORA_DIR
python3 scripts/prepare_diarization.py \
    --target_dir $S3PRL_DIR/downstream/diarization/data \
    --source_dir $CORPORA_DIR/Libri2Mix/wav16k/max/metadata

Train

python3 run_downstream.py -n ExpName -m train -u fbank -d diarization

Test

python3 run_downstream.py -m evaluate -e result/downstream/ExpName/best-states-dev.ckpt

Scoring

Clone dscore

git clone https://github.com/ftshijt/dscore

Change the path in downstream/diarization/score.sh

dscore_dir="root directory of your cloned dscore"

Run scoring

./downstream/diarization/score.sh result/downstream/ExpName downstream/diarization/data/test

The scoring results will look like

One should report the lowest number at the bottom, where the column represents DER and the most bottom row will always have the lowest DER which is the number we will report.
Re-check the scoring results

Running the above scoring script takes time. If you want to re-check the scored results, use
```
./downstream/diarization/report.sh result/downstream/ExpName
```

ER: Emotion Recognition

Prepare data

Download dataset and unzip. You will need to fill a form in IEMOCAP official website to get the dataset.
- https://sail.usc.edu/iemocap/

Preprocess

python3 ./downstream/emotion/IEMOCAP_preprocess.py "/path/to/IEMOCAP"

Change the path in downstream/emotion/config.yaml

downstream_expert:
    datarc:
        root: "root directory of IEMOCAP"

Train

IEMOCAP provides 5 splits of data: Section1, Section2, Section3, Section4 and Section5. Conventionally, each split will be selected as the test set and train the model with other 4 splits. That is, 5 times of training and testing is required, and 5 testing scores will be averaged to report the final number. We can use -v option to control which split we want to reserve as the test set.

# -v: fold1, fold2, fold3, fold4, fold5
python3 run_downstream.py -n ExpName -m train -u fbank -d emotion -v fold1

Test

The testing is done on-the-fly with training since it is not costly. Use the following command to get the testing result from the best-dev ckpt.

python3 utility/get_best_dev.py result/downstream/ExpName/log.log

Cross validation

for test_fold in fold1 fold2 fold3 fold4 fold5;
do
    python3 run_downstream.py -n ExpName_$test_fold -m train -u fbank -d emotion -v $test_fold
    python3 utility/get_best_dev.py result/downstream/ExpName_$test_fold/log.log
done

More tasks

Phone Classification

Specified by the command -d (with different variants):

phone_linear
phone_linear_concat
phone_1hidden

Prepare data

Download the raw LibriSpeech corpus and unzip.

cd /path/to/put/data
wget https://www.openslr.org/resources/12/train-clean-100.tar.gz
tar zxvf train-clean-100.tar.gz

After extracting the file, you should have the file structure as following:
```
LibriSpeech
├── train-clean-100
└── README.TXT
```

unzip phone labels:

cd data/cpc_phone
unzip converted_aligned_phones.zip

(Optional) Allow bucketing to increase training efficientcy & speed, this will generate a directory called data/len_for_bucket:
```
python preprocess/generate_len_for_bucket.py --data_root "your_libri_root" --output_path ./data/
```

Change the following paths under phone_*/config.yaml to your own:

libri_root: '/media/andi611/1TBSSD/LibriSpeech/'
bucket_file: 'data/len_for_bucket'

Training

python run_downstream.py -m train -u baseline -d phone_linear -n ExpName
python run_downstream.py -m train -u baseline -d phone_linear_concat -n ExpName
python run_downstream.py -m train -u baseline -d phone_1hidden -n ExpName

Testing

Testing is done on-the-fly with training since it is not costly. Use the following command to get the testing result from the best-dev ckpt:

python utility/get_best_dev.py result/downstream/ExpName/log.log

Trainable Spoken Term Detection - SWS2013

Specified by the command -d sws2013

Prepare data

Download the SWS2013
- https://speech.fit.vutbr.cz/files/sws2013Database.tgz

Specify the place to unpack the database

export CORPORA_DIR=/YOUR/CORPORA/DIR/PATH

Unpack the tarball

tar zxf sws2013Database.tgz -C $CORPORA_DIR

Further unpack the scoring script tarball

tar zxf $CORPORA_DIR/sws2013Database_dev_eval/scoring_atwv_sws2013_full.tgz -C $CORPORA_DIR/sws2013Database_dev_eval

Change the following path in sws2013/config.yaml to yours

sws2013_root: /YOUR/CORPORA/DIR/PATH/sws2013Database_dev_eval
sws2013_scoring_root: /YOUR/CORPORA/DIR/PATH/sws2013Database_dev_eval/scoring_atwv_sws201

Train

python3 run_downstream.py -m train -u fbank -d sws2013 -n ExpName

Intent Classification - SNIPS

Variants to this task: None

Prepare data:

Prepare the Audio file:

cd /path/to/put/data
wget https://shangwel-asr-evaluation.s3-us-west-2.amazonaws.com/audio_slu_v3.zip
unzip audio_slu_v3.zip

Prepare the NLU annotation file:

git clone https://github.com/aws-samples/aws-lex-noisy-spoken-language-understanding.git
cp -r aws-lex-noisy-spoken-language-understanding/* audio_slu

After extracting the file, you should have the file structure as following:

audio_slu
├── data
│   └── nlu_annotation
│       └── [*.csv]
├── license
├── audio_Aditi
...
└── audio_Salli

Change the following paths under audio_snips/config.yaml to your own and specify speakers you want in training set and test set:

file_path: /home/raytz/Disk/data/audio_slu
train_speakers: 
  - Aditi
  ...
  - Salli
test_speakers:
  - Aditi
  ...
  - Salli

Example run command (with a pseudo upstream):

python3 run_downstream.py -m train -u baseline -d audio_snips -n HelloWorld

Intent Classification - ATIS

Variants to this task: None

Prepare data:

Prepare the dataset (under the folder of /groups/public):

//first sftp to the battleship
lcd /path/to/put/data
get -r /groups/public/atis

After downloading the dataset, you should have the file structure as following:

atis
├── test
├── nlu_iob
├── train
├── dev
├── all.trans.txt
├── all.iob.trans.txt
└── slot_vocabs.txt

Change the following paths under audio_snips/config.yaml to your own:

file_path: /home/raytz/Disk/data/atis

Example run command (with a pseudo upstream):

python3 run_downstream.py -m train -u baseline -d atis -n HelloWorld

Spoken sentiment analysis - CMU-MOSEI

Prepare data:
1. Download and unzip data to the path you want:
```
cd /path/to/put/data
wget http://immortal.multicomp.cs.cmu.edu/raw_datasets/CMU_MOSEI.zip
unzip CMU_MOSEI.zip
```
1. After extracting the file, you should have the file structure as following. You only need to keep the folder "Audio"
```
CMU_MOSEI
├── Audio
│   └── FULL
├── Videos
│   └── ..
├── ..
```
1. Change the following paths under mosei/segment_audio.sh to your own.
```
python3 ./utility/segment_audio.py **/home/godiclili/Audio**
```
1. Segment Audio by running mosei/segment_audio.sh
```
bash segment_audio.sh
```
1. Change the following paths under mosei/config.yaml to your own.
```
data_dir: **/home/godiclili/Audio**
```
1. Specify number of classes (2/7, default 2) for classification under mosei/config.yaml
```
num_class: 2
```
Example run command (with a pseudo upstream):

python3 run_downstream.py -m train -u baseline -d mosei -n HelloWorld

Source Separation

Data preparation: Simulate Libri2Mix data for source separation. For source separation, we only need 16kHz and min condition. (Usually for source separation, people are using 8kHz min condition, but due to the constrait of pre-trained models we are using 16kHz)

# download the script and simulate Libri2Mix dataset
git clone https://github.com/HuangZiliAndy/LibriMix.git
cd LibriMix 
./generate_librimix.sh storage_dir

# prepare train, dev and test data in Kaldi format
python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part train-100 storage_dir downstream/separation_stft/data

python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part dev storage_dir downstream/separation_stft/data

python downstream/separation_stft/scripts/LibriMix/data_prepare.py \
--part test storage_dir downstream/separation_stft/data

train:

Train with STFT magnitude as the upstream.

python3 run_downstream.py \
       --mode train --config downstream/separation_stft/configs/cfg.yaml \
       --downstream separation_stft \
       --upstream stft_mag \
       --upstream_model_config 'upstream/log_stft/stft_mag.yaml' \
       --expdir experiment/separation_stft/stft_mag

Train with wav2vec2 as the upstream.

python3 run_downstream.py \
       --mode train --config downstream/separation_stft/configs/cfg.yaml \
       --downstream separation_stft \
       --upstream wav2vec2 \
       --expdir experiment/separation_stft/wav2vec2

I included one upstream called stft_mag in my code, and it is simply extracting STFT magnitude. I notice that s3prl has support for different acoustic features in baseline, but since I am predicting STFT masks, I have to make sure the setup for STFT features and desired STFT masks are identical.

In other words, (1) when you are using STFT magnitude as the upstream, you need to make sure that the STFT parameters in downstream/separation_stft/configs/cfg.yaml and upstream/log_stft/stft_mag.yaml are identical. (2) When you are using other upstreams like wav2vec2, you need to make sure that the hop_length in downstream/separation_stft/configs/cfg.yaml is the same as the upstream. (like in this file, I am using a hop_length of 320 corresponding to 20ms stride for wav2vec2)

test:

python3 run_downstream.py \
       --mode evaluate \
       --past_exp experiment/separation_stft/stft_mag/modelbest.ckpt \
       --config downstream/separation_stft/configs/cfg.yaml \
       --downstream separation_stft \
       --upstream stft_mag \
       --upstream_model_config 'upstream/log_stft/stft_mag.yaml' \
       --expdir experiment/separation_stft/stft_mag

The model is expected to output si-sdri on the test set.

Add new downstream tasks

Each downstream task is defined by a self-contained folder under this downstream folder, like the task ASR is defined in downstream/asr. Once a new folder is placed under this downstream folder, says downstream/blabla/, you can specify to run this new downstream task with -d blabla option in run_downstream.py script.

By self-contained we mean there should be all the downstream specific materials under your task folder, including the definition for dataset, datalader,model, and loss. How to define these materials are completely free, while the only requirement is to provide an expert.py file with an DownstreamExpert nn.module at the root of your downstream folder, where 3 object methods are implemented: get_dataloader, forward, and log_records.

The fastest way to know how the framework works is to run a minimum example, so we provide a pseudo task downstream/example/, which can always be ran up by:

python3 run_downstream.py -u fbank -d example -n HelloWorld

Hence, you can refer to downstream/example/expert.py for the minimum requirement and implementation specification. Also, you can use downstream/example/ as an initial template for developing a new downstream task.

Note 1

Please use relative import in your downstream folder, in case we might want to rename or move the location for the downstream folder in future.

Note 2

If you want to train your downstream task with distributed training, you should take care to use DistributedSampler when providing the training dataloader in your expert file.

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Downstream tasks

Introduction

How to use

I. General requirement

II A. Run the developed tasks

II B. Develop new tasks

General usage

Start a new downstream training experiment

Resume training from a checkpoint

Fault-tolerant training

Distributed training

First specify your GPU number

Simple training

Resume training

Fault-tolerant training

Test a checkpoint

Preferable: Use the same args & config as training time

Alternative: Use another set of args & config for testing

Test the distributed trained checkpoint

SUPERB Benchmark

PR: Phoneme Recognition

Prepare data

Training

Testing

ASR: Automatic Speech Recognition

Prepare data

Training

Testing without LM

Testing with KenLM + LibriSpeech official 4-gram LM

I. Prepare Decoding Environment

II. Test

KS: Keyword Spotting

Prepare data

Training

Testing

Compatible with Speech Command v2

QbE: Query-by-Example Spoken Term Detection

Prepare data

Dynamic Time Warping (DTW)

Scoring

IC: Intent Classification - Fluent Speech Commands

Prepare data

Training

Testing

SF: End-to-end Slot Filling

Prepare data

Train

Test

SID: Speaker Identification

Prepare data

Train

Test

ASV: Automatic Speaker Verification

Prepare data

Training

Testing

Report numbers

SD: Speaker Diarization

Prepare data

Train

Test

Scoring

ER: Emotion Recognition

Prepare data

Train

Test

Cross validation

More tasks

Phone Classification

Prepare data

Training

Testing

Trainable Spoken Term Detection - SWS2013

Prepare data