A robust framework for sound event localization and detection on real recordings (DCASE2022 Challenge)
This repository provides the official training and testing code for the SE-ResNet34—BiGRU model, which won 3rd place in Task 3: Sound Event Localization and Detection(SELD), DCASE Challenge 2022.
Detailed information regarding our methodology, including the architecture shown below, can be found in our technical report: "A robust framework for sound event localization and detection on real recordings." (honored with Judge's Award)
-
Bootstrapping Training Batch: While external datasets were permitted, we observed that simply increasing simulated data often led to performance degradation. To mitigate this, we proposed a batch balancing strategy that enables the model to learn from diverse external sound samples while effectively retaining real-world context.
-
First Introduction of TTA in SELD: To further enhance performance, we introduce a Test Time Augmentation (TTA) technique utilizing 16-pattern rotation for First-Order Ambisonics (FOA). This is the first instance of applying such a TTA strategy to the SELD task.
Our model was trained and evaluated using a combination of real-world and synthetic datasets:
- External Datasets: We utilized sound samples synthesized from five external sources, AudioSet[1], FSD50K[2], ESC-50[3], IRMAS[4], and Wearable SELD[5].
- Previous DCASE Challenge Data: We incorporated synthetic SELD datasets from previous DCASE Challenges (2020 and 2021)[6, 7], which were generated using similar simulation techniques.
- Real-world Data: We used the STARSS22 dataset[8], which contains real-world soundscapes provided for the DCASE 2022 Challenge.
All simulated soundscapes were generated using the official data generation repository provided by the challenge organizers.
[1] J. F. Gemmeke, et al., “Audio Set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP, 2017.
[2] E. Fonseca, et al., “FSD50K: An Open Dataset of Human-Labeled Sound Events,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 30, pp. 829-852, 2022.
[3] K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in Proc. ACM Conference on Multimedia, 2015.
[4] J. J. Bosch, et al., “A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals,” in Proc. ISMIR, 2012.
[5] K. Nagatomo, et al., “Wearable Seld Dataset: Dataset For Sound Event Localization And Detection Using Wearable Devices Around Head,” in Proc. ICASSP, 2022.
[6] A. Politis, et al., “A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection,” in Proc. DCASE2020 Workshop, 2020.
[7] A. Politis, et al., “A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection,” in Proc. DCASE2021 Workshop, 2021.
[8] A. Politis, et al., “STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events,” in Proc. DCASE2022 Workshop, 2022.
This repository is built upon the official DCASE Challenge Baseline Repository.
The core components are organized as follows:
custom_model.pydefines the model architecture, featuring a SE-ResNet34 backbone integrated with BiGRU and ADPIT-based SELD prediction heads.parameters.pycontains a set of hyperparameters for both the training and inference phases.main_train_model.pyis the primary script for model training.main_test_model.pyloads the model from a specified weight path (defined inparameters.py) and evaluates its SELD performance.
J. S. Kim, et al., "A robust framework for sound event localization and detection on real recordings," Tech. Rep., DCASE2022 Challenge, 2022.
@techreport{kim2022_dcase,
title={A robust framework for sound event localization and detection on real recordings},
author={Kim, Jin Sob and Park, Hyun Joon and Shin, Wooseok and Han, Sung Won},
institution={DCASE2022 Challenge},
year={2022},
month={June}
}
@article{kim2025_arxiv,
title={A Robust framework for sound event localization and detection on real recordings},
author={Kim, Jin Sob and Park, Hyun Joon and Shin, Wooseok and Han, Sung Won},
journal={arXiv preprint arXiv.2512.22156v1},
year={2025}
}
This repository is released under the MIT license.
Thanks to:
- sharathadavanne/seld-dcase2022: DCASE Challenge 2022, Task 3 (SELD) baseline.
- danielkrause/DCASE2022-data-generator: Generating synthetic audio mixtures suitable for DCASE Challenge 2022 Task 3.
