A robust framework for sound event localization and detection on real recordings (DCASE2022 Challenge)

This repository provides the official training and testing code for the SE-ResNet34—BiGRU model, which won 3rd place in Task 3: Sound Event Localization and Detection(SELD), DCASE Challenge 2022.

Detailed information regarding our methodology, including the architecture shown below, can be found in our technical report: "A robust framework for sound event localization and detection on real recordings." (honored with Judge's Award)

Key Approaches

Bootstrapping Training Batch: While external datasets were permitted, we observed that simply increasing simulated data often led to performance degradation. To mitigate this, we proposed a batch balancing strategy that enables the model to learn from diverse external sound samples while effectively retaining real-world context.
First Introduction of TTA in SELD: To further enhance performance, we introduce a Test Time Augmentation (TTA) technique utilizing 16-pattern rotation for First-Order Ambisonics (FOA). This is the first instance of applying such a TTA strategy to the SELD task.

Getting Started

Environmental Supports

Datasets

Our model was trained and evaluated using a combination of real-world and synthetic datasets:

External Datasets: We utilized sound samples synthesized from five external sources, AudioSet[1], FSD50K[2], ESC-50[3], IRMAS[4], and Wearable SELD[5].
Previous DCASE Challenge Data: We incorporated synthetic SELD datasets from previous DCASE Challenges (2020 and 2021)[6, 7], which were generated using similar simulation techniques.
Real-world Data: We used the STARSS22 dataset[8], which contains real-world soundscapes provided for the DCASE 2022 Challenge.

All simulated soundscapes were generated using the official data generation repository provided by the challenge organizers.

[1]   J. F. Gemmeke, et al., “Audio Set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP, 2017.
[2]   E. Fonseca, et al., “FSD50K: An Open Dataset of Human-Labeled Sound Events,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30, pp. 829-852, 2022.
[3]   K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proc. ACM Conference on Multimedia, 2015.
[4]   J. J. Bosch, et al., “A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals,” in Proc. ISMIR, 2012.
[5]   K. Nagatomo, et al., “Wearable Seld Dataset: Dataset For Sound Event Localization And Detection Using Wearable Devices Around Head,” in Proc. ICASSP, 2022.
[6]   A. Politis, et al., “A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection,” in Proc. DCASE2020 Workshop, 2020.
[7]   A. Politis, et al., “A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection,” in Proc. DCASE2021 Workshop, 2021.
[8]   A. Politis, et al., “STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” in Proc. DCASE2022 Workshop, 2022.

Description

This repository is built upon the official DCASE Challenge Baseline Repository.
The core components are organized as follows:

custom_model.py defines the model architecture, featuring a SE-ResNet34 backbone integrated with BiGRU and ADPIT-based SELD prediction heads.
parameters.py contains a set of hyperparameters for both the training and inference phases.
main_train_model.py is the primary script for model training.
main_test_model.py loads the model from a specified weight path (defined in parameters.py) and evaluates its SELD performance.

Citation

J. S. Kim, et al., "A robust framework for sound event localization and detection on real recordings," Tech. Rep., DCASE2022 Challenge, 2022.

@techreport{kim2022_dcase,
    title={A robust framework for sound event localization and detection on real recordings},
    author={Kim, Jin Sob and Park, Hyun Joon and Shin, Wooseok and Han, Sung Won},
    institution={DCASE2022 Challenge},
    year={2022},
    month={June}
}

@article{kim2025_arxiv,
  title={A Robust framework for sound event localization and detection on real recordings},
  author={Kim, Jin Sob and Park, Hyun Joon and Shin, Wooseok and Han, Sung Won},
  journal={arXiv preprint arXiv.2512.22156v1},
  year={2025}
}

License

This repository is released under the MIT license.

Thanks to:

sharathadavanne/seld-dcase2022: DCASE Challenge 2022, Task 3 (SELD) baseline.
danielkrause/DCASE2022-data-generator: Generating synthetic audio mixtures suitable for DCASE Challenge 2022 Task 3.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
fig		fig
utils		utils
LICENSE		LICENSE
README.md		README.md
custom_model.py		custom_model.py
main_test_model.py		main_test_model.py
main_train_model.py		main_train_model.py
parameters.py		parameters.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A robust framework for sound event localization and detection on real recordings (DCASE2022 Challenge)

Key Approaches

Getting Started

Environmental Supports

Datasets

Description

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A robust framework for sound event localization and detection on real recordings (DCASE2022 Challenge)

Key Approaches

Getting Started

Environmental Supports

Datasets

Description

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages