Resource list of ASAudio: A Survey of Advanced Spatial Audio Research.
We introduce ASAudio, a comprehensive survey covering the representations, understanding tasks, and generation tasks in spatial audio, as well as the relevant datasets and evaluation metrics.
In this repository, we provide links to related papers and their corresponding code.
We hope this helps you appreciate the fascinating world of spatial audio!
- 2025.8.24: The survey is released on arXiv.
- ASAudio: A Survey of Advanced Spatial Audio Research - Zhiyuan Zhu*, Yu Zhang*, Wenxiang Guo, Changhao Pan, Zhou Zhao | Zhejiang University
- 🚀Quick Start
This repository is the official repository of the ASAudio: A Survey of Advanced Spatial Audio Research.
Abstract
With the rapid development of spatial audio technologies today, applications in AR, VR and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outline existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives.
| Attribute | Natural Language | Spatial Position | Visual Information | Monaural Audio |
|---|---|---|---|---|
| Primary Info | Semantic, relational, implicit spatial | Explicit spatial, dynamic | Semantic, spatial, dynamic | Acoustic (timbre, pitch, content) |
| Control Precision | Low | Very high | High | N/A |
| Abstraction Level | High | Low | High | Low |
| Interpretability | Indirect | Direct | Indirect | Indirect |
| Key Challenges | Ambiguity; semantic–signal gap | No semantics; tedious authoring | Ambiguity; occlusion; compute cost | Lack of spatial cues |
| Attribute | Channel-Based | Scene-Based | Object-Based |
|---|---|---|---|
| Freedom of Listening Position | Limited | High | Moderate |
| Playback System Dependency | Very high | High | Low |
| Scalability | Low | Moderate | Excellent |
| Playback-End Complexity | Low | High | Moderate |
| Common Formats | Stereo; 5.1/7.1 surround | Ambisonics; wave-field synthesis (WFS) | Dolby Atmos; DTS:X; MPEG-H 3D Audio |
| Dataset | Format | Collect | Hours | Type | Labels | URL |
|---|---|---|---|---|---|---|
| Sweet-Home | Multi | Recorded | 47.3 | Speech | Text | |
| Voice-Home | Multi | Recorded | 2.5 | Speech | Text, Geometric |
|
| YT-ALL & REC-STREET | FOA | Crawled | 116.5 | Audio | Video, Text |
|
| FAIR-Play | Binaural | Recorded | 5.2 | Audio | Video |
|
| SECL-UMons | Multi | Recorded | 5 | Audio | Text, Geometric |
|
| YT-360 | FOA | Crawled | 246 | Audio | Video |
|
| EasyCom | Multi | Recorded | 5 | Speech | Geometric, Text |
|
| Binaural_Dataset | Binaural | Recorded | 2 | Speech | Geometric |
|
| SimBinaural | Binaural | Sim/Crawl | 143 | Audio | Video, Geometric |
|
| Spatial LibriSpeech | FOA | Simulated | 650 | Speech | Text, Geometric |
|
| STARSS23 | FOA | Recorded | 7.5 | Audio | Video, Geometric |
|
| YT-Ambigen | FOA | Crawled | 142 | Audio | Video |
|
| BEWO-1M | Binaural | Simulated | 2.8k | Audio | Text/Image, Geo |
|
| MRSDrama | Binaural | Recorded | 98 | Speech | Text, Video, Geo |
|
If you find this code useful in your research, please cite our work:
@misc{zhu2025asaudiosurveyadvancedspatial,
title={ASAudio: A Survey of Advanced Spatial Audio Research},
author={Zhiyuan Zhu and Yu Zhang and Wenxiang Guo and Changhao Pan and Zhou Zhao},
year={2025},
eprint={2508.10924},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2508.10924},
}

