Performing classification of speakers in speech signals. The model has achieved a remarkable accuracy rate of 97.06% on the task, ranking in the top 6% out of 1000 teams.
The original dataset utilized is VoxCeleb. Data preprocessing involved transforming it into vectors using mel-frequency spectrum. Initially, the signal was converted into the frequency domain using Discrete Fourier Transform to obtain a spectrum. Subsequently, a filter bank, log transform, and Discrete Cosine Transform were applied to construct the vector. A window length of 128 was randomly chosen from the vectors. The processed dataset is stored here.
The constructed model, based on Conformer, undergoes rigorous training using the preprocessed dataset to optimize its parameters and achieve the desired predictive performance.
To enhance model performance, the following techniques were employed:
- Additive Margin Softmax
- Cosine learning rate scheduler
The preprocessed VoxCeleb dataset serves as the primary data source for this project, accessible here.
For detailed implementation and usage instructions, please refer to the provided code.