Query by Vocal Imitation (QVIM) enables users to search a database of sounds via a vocal imitation of the desired sound. This offers sound designers an intuitively expressive way of navigating large sound effects databases.
This is our code for this challenge.
Important Dates
- Challenge start: April 1, 2025
- Challenge end: June 15, 2025
- Challenge results announcement: July 15, 2025
For more details, please have a look at our website.
This repository contains the modified baseline system for the AES AIMLA Challenge 2025. The architecture and the training procedure is based on "Improving Query-by-Vocal Imitation with Contrastive Learning and Audio Pretraining" (DCASE2025 Workshop).
- The training loop is implemented using PyTorch and PyTorch Lightning.
- Logging is implemented using Weights and Biases.
- It uses the MobileNetV3 (MN) pretrained on AudioSet to encode audio recordings.
- The system is trained on VimSketch and evaluated on the public evaluation dataset described on our website.
Prerequisites
- linux (tested on Ubuntu 24.04)
- conda, e.g., Miniconda3-latest-Linux-x86_64.sh
- Clone this repository.
git clone https://github.com/qvim-aes/qvim-baseline.git
- Create and activate a conda environment with Python 3.10:
conda create -f environment.yml
conda activate qvim-ensemble
- Install 7z, e.g.,
# (on linux)
sudo apt install p7zip-full
# (on windows)
conda install -c conda-forge 7zip
For linux users: do not use conda package p7zip - this package is based on the outdated version 16.02 of 7zip; to extract the dataset, you need a more recent version of p7zip.
- If you have not used Weights and Biases for logging before, you can create a free account. On your
machine, run
wandb loginand copy your API key from this link to the command line.
All training is handled by the unified script src/qvim_mn_baseline/train.py. It is highly configurable and supports multiple model architectures and data augmentation strategies.
To see all available options, run:
export PYTHONPATH=$(pwd)/src
python src/qvim_mn_baseline/train.py --helpThis command replicates the original baseline without advanced augmentations.
python src/qvim_mn_baseline/train.py \
--model_type mobilenet \
--project "qvim-experiments" \
--model_save_path "checkpoints"This example uses the powerful PaSST model and enables the "light" augmentation profile.
python src/qvim_mn_baseline/train.py \
--model_type passt \
--use_augmentations true \
--aug_profile light \
--batch_size 12 \
--n_epochs 50 \
--project "qvim-experiments" \
--model_save_path "checkpoints"This command trains the BEATs model using the "full" augmentation profile, which includes SpecMix.
python src/qvim_mn_baseline/train.py \
--model_type beats \
--beats_checkpoint_path "path/to/your/BEATs_iter3.pt" \
--use_augmentations true \
--aug_profile full \
--batch_size 8 \
--n_epochs 75 \
--project "qvim-experiments" \
--model_save_path "checkpoints"| Model Name | MRR (exact match) |
|---|---|
| random | 0.0444 |
| MN baseline | 0.2726 |
| MN + Light Aug | 0.2835 |
| PaSST | 0.1502 |
| PANNs | 0.1577 |
| BEATs | 0.2309 |
For questions or inquiries, please contact rahul.peter@aalto.fi or vivek.mohan@aalto.fi.
@inproceedings{Greif2024,
author = "Greif, Jonathan and Schmid, Florian and Primus, Paul and Widmer, Gerhard",
title = "Improving Query-By-Vocal Imitation with Contrastive Learning and Audio Pretraining",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)",
address = "Tokyo, Japan",
month = "October",
year = "2024",
pages = "51--55"
}