DCASE 2025 Task 3 Baseline: Stereo sound event localization and detection in regular video content

Please visit the official webpage of the DCASE 2025 Challenge for the task details.

This year's challenge features both an audio-only and an audio-visual track. Unlike previous editions, which utilized four-channel FOA/MIC format audio and 360-degree video, this year's challenge employs stereo audio and standard video. The dataset used for this challenge, Stereo SELD, is derived from the STARSS23 dataset, with each sample having a duration of 5 seconds. Details on the StereoSELD dataset creation process can be found in the task description page.

Task Overview

Participants must detect sound events along with their direction of arrival (DOA) and distance. In the audio-visual track, an additional task involves determining whether a detected sound event is on-screen or off-screen. Since the input audio is stereo, participants are only required to predict the azimuth DOA. Furthermore, to address front-back confusion, all DOA values corresponding to the rear hemisphere are mapped to the front hemisphere.

Baseline Model

For the audio baseline, we modify the SELDnet studied in [1]. We introduced multi-head self-attention blocks in the SELDnet architecture based on the findings in [4]. For the output format, to support the detection of multiple instances of the same class overlapping we use the Multi-ACCDOA representation [2] including distance estimation [5]. For the audio-visual baseline, inspired by the work in [3], we extract ResNet-50 features for the video frames corresponding to the input audio. The frame rate is set to 10fps. The visual features are fused with the audio features using transformer decoder blocks. The output of the transformer block is fed to linear layers for obtaining the Multi-ACCDDOA representation.

The input is the stereo audio and their corresponding video frames from which log-mel spectrogram and ResNet-50 features are extracted respectively. The model predicts all the active sound event classes for each frame along with their respective spatial location, producing the temporal activity and DOA trajectory for each sound event class. Each sound event class in the Multi-ACCDOA output is represented by three regressors that estimate the Cartesian coordinates x, y axes of the azimuth DOA around the microphone and a distance value. In case of the audio-visual model, there is an additional binary output neuron that predicts whether the sound event is within the video frame or outside of it.

Dataset

The Stereo SELD dataset, derived from STARSS23, comprises 30,000 real recordings, each 5 seconds long. Each recording includes stereo audio, standard video, and the corresponding detection and localization labels. For further details, please refer to the task description webpage.

NOTE : Participants must use the fixed development test split provided in the baseline method for reporting development scores. The evaluation set will be released at a later stage.

The development dataset can be downloaded from the link - DCASE2025 Task3 Stereo SELD Dataset

NOTE: Additional audio-visual synthetic data can be generated using the Spatial Scaper library, and the process for creating stereo versions of the synthetic data is detailed on the task description page.

Project Structure

main.py script serves as the entry point for the project. It coordinates all other scripts and executes the workflow.
data_generator.py script is responsible for generating data and labels for training and evaluation.
extract_features.py script extracts relevant features from the raw data (audio, visuals and labels(accdoa or multiaccdoa format)) to be used for model training.
inference.py script handles model inference, allowing predictions on the trained model.
loss.py script defines singleaccdoa and multiaccdoa(adpit) loss functions used during training.
metrics.py script implements different evaluation metrics to assess model performance.
model.py script defines the seld model architecture.
parameters.py script contains all hyperparameters and configurations. If a user needs to modify parameters, they should update them here.
utils.py script includes various utility functions used throughout the project.

How to use this repo?

Pre-requisites

The provided codebase has been tested on python 3.9 and torch 2.6

Download and organize the dataset

Download the dataset from .
Extract the dataset into a root directory named DCASE2025_SELD_dataset/.
After unzipping, the directory structure should be:

DCASE2025_SELD_dataset/
├── stereo_dev/
│   ├── dev-train-tau/*.wav
│   ├── dev-train-sony/*.wav
│   ├── dev-test-tau/*.wav
│   ├── dev-test-sony/*.wav
├── metadata_dev/
│   ├── dev-train-tau/*.csv
│   ├── dev-train-sony/*.csv
│   ├── dev-test-tau/*.csv
│   ├── dev-test-sony/*.csv
├── video_dev/
│   ├── dev-train-tau/*.mp4
│   ├── dev-train-sony/*.mp4
│   ├── dev-test-tau/*.mp4
│   ├── dev-test-sony/*.mp4

If you generate synthetic data, place it into the respective folders under the name dev-train-synth. The baseline results reported here were trained using an additional synthetic dataset consisting of 15,000 5-second audio-visual samples.

Inference on the Trained Model

To run inference using the provided baseline pre-trained models present in the checkpoints directory:

Update the model_dir in inference.py with the path to the pretrained model directory.
Run inference using the following command:

python inference.py

Training the Stereo SELDnet Model

To train the model with default settings:

modality = 'audio_visual'
multiaccdoa = True

Run the following command:

python main.py

Custom Training Configuration

To modify training settings:

Edit the parameters.py file to adjust the configurations according to your requirements.
Run the following command to train with updated settings:

python main.py

Results on development dataset

Metrics Overview

As the SELD evaluation metric we employ the joint localization and detection metrics proposed in [6], with extensions from [2, 5] to support multi-instance scoring of the same class and distance estimation.

F-score (F_20°) – Primarily focused on detection, considering a prediction correct only if:
- The predicted and reference class match.
- The DOA angular error is within 20°.
- The relative distance error is below 1.0.
F-score (F_20°/on-off) – Additional F-score for the audio-visual task, considering a prediction correct only if:
- The predicted and reference class match.
- The DOA angular error is within 20°.
- The relative distance error is below 1.0.
- The event is correctly identified as being On-screen or Off-screen.
DOA Angular Error (DOAE_CD) – Measures the class-dependent doa error in degrees.
Relative Distance Error (RDE_CD) – Defined as the difference between the estimated and reference distances, normalized by the reference distance.
On-screen/Off-screen accuracy - Additionally, for the audio-visual track, we also evaluate the accuracy of predicting whether a detected sound event is on-screen or off-screen.

Unlike location-aware detection, no angular or distance thresholds are applied for DOAE_CD and RDE_CD.

The evaluation metric scores for the test split of the development dataset is given below.

Dataset	F_20°	F_20°/on-off	DOAE_CD	RDE_CD	on/off screen accuracy
Audio only	22.78%	N/A	24.5°	0.41	N/A
Audio-visual	26.77%	20.0%	23.8°	0.40	0.80

Note: The reported baseline system performance is not exactly reproducible due to varying setups. However, you should be able to obtain very similar results.

Submission

Make sure the file-wise output you are submitting is produced at 100 ms hop length. At this hop length a 5s audio file has 50 frames.

For more information on the submission file formats, check the task webpage

References

License

This repo and its contents have the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.idea		.idea
__pycache__		__pycache__
cst_conformer		cst_conformer
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
conformer_block.py		conformer_block.py
consolidate_results.py		consolidate_results.py
consolidate_results_with_distance.py		consolidate_results_with_distance.py
data_analysis_all.zip		data_analysis_all.zip
data_generator.py		data_generator.py
eda_plots.py		eda_plots.py
eda_plots.zip		eda_plots.zip
eda_plots_comparison.txt		eda_plots_comparison.txt
extract_features.py		extract_features.py
final_inference.py		final_inference.py
inference.py		inference.py
inference_for_eval_dataset.py		inference_for_eval_dataset.py
inspect_model_output.py		inspect_model_output.py
loss.py		loss.py
main.py		main.py
main_sed_doa.py		main_sed_doa.py
main_sed_doa_sde.py		main_sed_doa_sde.py
main_sed_sde.py		main_sed_sde.py
metrics.py		metrics.py
model.py		model.py
model_conformer.py		model_conformer.py
parameters.py		parameters.py
parameters_multiaccdoa_model_conformer.py		parameters_multiaccdoa_model_conformer.py
parameters_sed_doa.py		parameters_sed_doa.py
parameters_sed_sde.py		parameters_sed_sde.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DCASE 2025 Task 3 Baseline: Stereo sound event localization and detection in regular video content

Task Overview

Baseline Model

Dataset

Project Structure

How to use this repo?

Pre-requisites

Download and organize the dataset

Inference on the Trained Model

Training the Stereo SELDnet Model

Custom Training Configuration

Results on development dataset

Metrics Overview

Submission

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DCASE 2025 Task 3 Baseline: Stereo sound event localization and detection in regular video content

Task Overview

Baseline Model

Dataset

Project Structure

How to use this repo?

Pre-requisites

Download and organize the dataset

Inference on the Trained Model

Training the Stereo SELDnet Model

Custom Training Configuration

Results on development dataset

Metrics Overview

Submission

References

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages