AES AIMLA Challenge 2025 System by Team AudioAlchemy

Query-by-Vocal Imitation Challenge

Query by Vocal Imitation (QVIM) enables users to search a database of sounds via a vocal imitation of the desired sound. This offers sound designers an intuitively expressive way of navigating large sound effects databases.

This is our code for this challenge.

Important Dates

Challenge start: April 1, 2025
Challenge end: June 15, 2025
Challenge results announcement: July 15, 2025

For more details, please have a look at our website.

Baseline System

This repository contains the modified baseline system for the AES AIMLA Challenge 2025. The architecture and the training procedure is based on "Improving Query-by-Vocal Imitation with Contrastive Learning and Audio Pretraining" (DCASE2025 Workshop).

The training loop is implemented using PyTorch and PyTorch Lightning.
Logging is implemented using Weights and Biases.
It uses the MobileNetV3 (MN) pretrained on AudioSet to encode audio recordings.
The system is trained on VimSketch and evaluated on the public evaluation dataset described on our website.

Getting Started

Prerequisites

linux (tested on Ubuntu 24.04)
conda, e.g., Miniconda3-latest-Linux-x86_64.sh

Clone this repository.

git clone https://github.com/qvim-aes/qvim-baseline.git

Create and activate a conda environment with Python 3.10:

conda create -f environment.yml
conda activate qvim-ensemble

Install 7z, e.g.,

# (on linux)
sudo apt install p7zip-full
# (on windows)
conda install -c conda-forge 7zip

For linux users: do not use conda package p7zip - this package is based on the outdated version 16.02 of 7zip; to extract the dataset, you need a more recent version of p7zip.

If you have not used Weights and Biases for logging before, you can create a free account. On your machine, run wandb login and copy your API key from this link to the command line.

Training

All training is handled by the unified script src/qvim_mn_baseline/train.py. It is highly configurable and supports multiple model architectures and data augmentation strategies.

To see all available options, run:

export PYTHONPATH=$(pwd)/src
python src/qvim_mn_baseline/train.py --help

Usage Examples

1. Train the MobileNetV3 Baseline (Original)

This command replicates the original baseline without advanced augmentations.

python src/qvim_mn_baseline/train.py \
    --model_type mobilenet \
    --project "qvim-experiments" \
    --model_save_path "checkpoints"

2. Train a PaSST Model with "Light" Augmentations

This example uses the powerful PaSST model and enables the "light" augmentation profile.

python src/qvim_mn_baseline/train.py \
    --model_type passt \
    --use_augmentations true \
    --aug_profile light \
    --batch_size 12 \
    --n_epochs 50 \
    --project "qvim-experiments" \
    --model_save_path "checkpoints"

3. Train a BEATs Model with "Full" Augmentations and SpecMix

This command trains the BEATs model using the "full" augmentation profile, which includes SpecMix.

python src/qvim_mn_baseline/train.py \
    --model_type beats \
    --beats_checkpoint_path "path/to/your/BEATs_iter3.pt" \
    --use_augmentations true \
    --aug_profile full \
    --batch_size 8 \
    --n_epochs 75 \
    --project "qvim-experiments" \
    --model_save_path "checkpoints"

Evaluation Results

Model Name	MRR (exact match)
random	0.0444
MN baseline	0.2726
MN + Light Aug	0.2835
PaSST	0.1502
PANNs	0.1577
BEATs	0.2309

Contact

For questions or inquiries, please contact rahul.peter@aalto.fi or vivek.mohan@aalto.fi.

Citation

@inproceedings{Greif2024,
    author = "Greif, Jonathan and Schmid, Florian and Primus, Paul and Widmer, Gerhard",
    title = "Improving Query-By-Vocal Imitation with Contrastive Learning and Audio Pretraining",
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024)",
    address = "Tokyo, Japan",
    month = "October",
    year = "2024",
    pages = "51--55"
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
resources		resources
src		src
.gitignore		.gitignore
AudioAlchemy_1.ipynb		AudioAlchemy_1.ipynb
AudioAlchemy_2.ipynb		AudioAlchemy_2.ipynb
README.md		README.md
alt_scripts.sh		alt_scripts.sh
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AES AIMLA Challenge 2025 System by Team AudioAlchemy

Query-by-Vocal Imitation Challenge

Baseline System

Getting Started

Training

Usage Examples

1. Train the MobileNetV3 Baseline (Original)

2. Train a PaSST Model with "Light" Augmentations

3. Train a BEATs Model with "Full" Augmentations and SpecMix

Evaluation Results

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AES AIMLA Challenge 2025 System by Team AudioAlchemy

Query-by-Vocal Imitation Challenge

Baseline System

Getting Started

Training

Usage Examples

1. Train the MobileNetV3 Baseline (Original)

2. Train a PaSST Model with "Light" Augmentations

3. Train a BEATs Model with "Full" Augmentations and SpecMix

Evaluation Results

Contact

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages