ASAudio: A Survey of Advanced Spatial Audio Research

Zhiyuan Zhu, Yu Zhang, Wenxiang Guo, Changhao Pan, Zhou Zhao | Zhejiang University

Resource list of ASAudio: A Survey of Advanced Spatial Audio Research.

We introduce ASAudio, a comprehensive survey covering the representations, understanding tasks, and generation tasks in spatial audio, as well as the relevant datasets and evaluation metrics.

In this repository, we provide links to related papers and their corresponding code.

We hope this helps you appreciate the fascinating world of spatial audio!

News

2025.8.24: The survey is released on arXiv.

🚀Quick Start

ASAudio: A Survey of Advanced Spatial Audio Research - Zhiyuan Zhu*, Yu Zhang*, Wenxiang Guo, Changhao Pan, Zhou Zhao | Zhejiang University
- News
🚀Quick Start

Introduction

This repository is the official repository of the ASAudio: A Survey of Advanced Spatial Audio Research.

Figure 1: The timeline of spatial audio models & datasets in recent years.

Abstract

With the rapid development of spatial audio technologies today, applications in AR, VR and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outline existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives.

Overall

Figure 2: Orgnization of this survey.

Representations of Spatial Audio

1. Input Representations

Attribute	Natural Language	Spatial Position	Visual Information	Monaural Audio
Primary Info	Semantic, relational, implicit spatial	Explicit spatial, dynamic	Semantic, spatial, dynamic	Acoustic (timbre, pitch, content)
Control Precision	Low	Very high	High	N/A
Abstraction Level	High	Low	High	Low
Interpretability	Indirect	Direct	Indirect	Indirect
Key Challenges	Ambiguity; semantic–signal gap	No semantics; tedious authoring	Ambiguity; occlusion; compute cost	Lack of spatial cues

Table 1: Comparative analysis of spatial audio input representations

Figure 1: The input representations and their fundamental processing steps.

2. Spatial Cues and Physical Modeling

2.1 Room Impulse Response (RIR)

Paper	URL	Code/Dataset
Few-shot audio-visual learning of environment acoustics
Spatial scaper: a library to simulate and augment soundscapes for sound event localization and detection in realistic rooms
Novel-view acoustic synthesis from 3D reconstructed rooms
A binaural room impulse response database for the evaluation of dereverberation algorithms
The Sweet-Home speech and multimodal corpus for home automation interaction		-
Dataset of Binaural Room Impulse Responses at Multiple Recording Positions, Source Positions, and Orientations in a Real Room		-
dEchorate: a calibrated room impulse response database for echo-aware signal processing
BIRD: Big impulse response dataset
Visually informed binaural audio generation without binaural audios
MeshRIR: A dataset of room impulse responses on meshed grid points for evaluating sound field analysis and synthesis methods
Mesh2ir: Neural acoustic impulse response generator for complex 3d scenes
A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection
Acoustic analysis and dataset of transitions between coupled rooms		-
Dataset of spatial room impulse responses in a variable acoustics room for six degrees-of-freedom rendering and analysis
On the authenticity of individual dynamic binaural synthesis		-

Table 2: The list of RIR papers and their URL

2.2 Head Related Transfer Function (HRTF)

Paper	URL	Code/Dataset
HRTF personalization based on ear morphology		-
On HRTF Notch Frequency Prediction using Anthropometric Features and Neural Networks		-
Magnitude modeling of personalized HRTF based on ear images and anthropometric measurements		-
Global HRTF interpolation via learned affine transformation of hyper-conditioned features
HRTF recommendation based on the predicted binaural colouration model		-
Modeling individual head-related transfer functions from sparse measurements using a convolutional neural network		-
Head-related transfer function interpolation from spatially sparse measurements using autoencoder with source position conditioning
HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection
Spatial upsampling of head-related transfer functions using a physics-informed neural network
HRTF field: Unifying measured HRTF magnitude representation with neural fields
Head-related transfer function interpolation with a spherical CNN
HRTF interpolation using a spherical neural process meta-learner		-
NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization

Table 3: The list of HRTF papers and their URL

3. Output Representations

Attribute	Channel-Based	Scene-Based	Object-Based
Freedom of Listening Position	Limited	High	Moderate
Playback System Dependency	Very high	High	Low
Scalability	Low	Moderate	Excellent
Playback-End Complexity	Low	High	Moderate
Common Formats	Stereo; 5.1/7.1 surround	Ambisonics; wave-field synthesis (WFS)	Dolby Atmos; DTS:X; MPEG-H 3D Audio

Table 4: Comparative analysis of spatial audio output representations

4. Spatial Audio Understanding Models

4.1 SELD Papers

Link

Paper	URL	Code
Learning to localize sound source in visual scenes
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning
Classification of spatial audio location and content using convolutional neural networks		-
An improved event-independent network for polyphonic sound event localization and detection
Sound event localization and detection of overlapping sources using convolutional recurrent neural networks
Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection
Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training		-
Sound event localization based on sound intensity vector refined by DNN-based denoising and source separation		-
Binaural source localization using deep learning and head rotation information		-
A sequence matching network for polyphonic sound event localization and detection		-
ACCDOA: Activity-coupled cartesian direction of arrival representation for sound event localization and detection		-
Binaural sound source distance estimation and localization for a moving listener
Audio-visual event localization in unconstrained videos		-
w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training
BAST: Binaural audio spectrogram transformer for binaural sound localization
A time-domain unsupervised learning based sound source localization method		-
Sslide: Sound source localization for indoors based on deep learning		-
Semi-supervised source localization with deep generative modeling		-
Semi-supervised source localization in reverberant environments with deep generative modeling		-
Joint measurement of localization and detection of sound events
3D localization of multiple sound sources with intensity vector estimates in single source zones		-
Self-supervised moving vehicle tracking with stereo sound		-
A probabilistic model for robust localization based on a binaural auditory front-end		-
Deepear: Sound localization with binaural microphones		-
Towards generating ambisonics using audio-visual cue for virtual reality
AD-YOLO: You look only once in training multiple sound event localization and detection		-
Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy
Sound event localization and detection using squeeze-excitation residual CNNs

Table 5: The list of SELD Papers and their URL

4.2 Spatial Audio Separation Papers

Paper	URL	Code
Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation		-
Beamforming techniques for multichannel audio signal separation		-
Deep learning based binaural speech separation in reverberant environments		-
Source separation based on binaural cues and source model constraints		-
Combining spectral and spatial features for deep learning based blind speaker separation		-
Real-time binaural speech separation with preserved spatial cues		-
2.5 d visual sound
Lavss: Location-guided audio-visual spatial audio separation
The cocktail party robot: Sound source separation and localisation with an active binaural head		-
The sound of pixels
Self-supervised generation of spatial audio for 360 video
Multichannel audio source separation with deep neural networks		-
Blind and neural network-guided convolutional beamformer for joint denoising, dereverberation, and source separation		-
Multi-microphone neural speech separation for far-field multi-talker speech recognition		-
Unsupervised Bayesian Surprise Detection in Spatial Audio with Convolutional Variational Autoencoder and LSTM Model		-
Integration of variational autoencoder and spatial clustering for adaptive multi-channel neural speech separation

Table 6: The list of Spatial Audio Separation Papers and their URL

4.3 Joint Learning Papers

Paper	URL	Code
Learning representations from audio-visual spatial alignment		-
Telling left from right: Learning spatial correspondence of sight and sound
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis
Learning neural acoustic fields
Overview of geometrical room acoustic modeling techniques		-
Av-rir: Audio-visual room impulse response estimation
Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation		-
Multi-Channel Mosra: Mean Opinion Score and Room Acoustics Estimation Using Simulated Data and A Teacher Model		-
Few-shot audio-visual learning of environment acoustics
Blind room parameter estimation using multiple multichannel speech recordings
Visual-based spatial audio generation system for multi-speaker environments		-
Learning Spatially-Aware Language and Audio Embeddings		-
BAT: Learning to Reason about Spatial Sounds with Large Language Models

Table 7: The list of Joint Learning Papers and their URL

5. Spatial Audio Generation Models

Paper	URL	Code
A structural model for binaural sound synthesis		-
Neural synthesis of binaural speech from mono audio
2.5 d visual sound
Cyclic Learning for Binaural Audio Generation and Localization		-
Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention		-
Geometry-aware multi-task learning for binaural audio generation from video		-
Multi-attention audio-visual fusion network for audio spatialization		-
Visually Guided Binaural Audio Generation with Cross-Modal Consistency		-
Interpretable binaural ratio for visually guided binaural audio generation		-
Cross-modal generative model for visual-guided binaural stereo generation		-
Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis
DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect		-
Neural fourier shift for binaural speech rendering
Visually informed binaural audio generation without binaural audios
Localize to binauralize: Audio spatialization from visual sound source localization
Sep-stereo: Visually guided stereophonic audio generation by associating source separation
Exploiting audio-visual consistency with partial supervision for spatial audio generation		-
Binaural audio generation via multi-task learning
End-to-end binaural speech synthesis		-
Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content		-
ViSAGe: Video-to-Spatial Audio Generation
OmniAudio: Generating Spatial Audio from 360-Degree Video
Towards generating ambisonics using audio-visual cue for virtual reality
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model		-
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model		-
Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models		-
Ambisonizer: Neural upmixing as spherical harmonics generation
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
Simple and controllable music generation
Moûsai: Text-to-music generation with long-context latent diffusion
Long-form music generation with latent diffusion
Listen2scene: Interactive material-aware binaural sound propagation for reconstructed 3d scenes
Immersive spatial audio reproduction for vr/ar using room acoustic modelling from 360 images		-
See-2-sound: Zero-shot spatial environment-to-spatial sound
Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis
See-2-sound: Zero-shot spatial environment-to-spatial sound
DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model		-
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model		-
SonicMotion: Dynamic Spatial Audio Soundscapes with Latent Diffusion Models		-
Simple and controllable music generation
Ambisonizer: Neural upmixing as spherical harmonics generation
ViSAGe: Video-to-Spatial Audio Generation
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
OmniAudio: Generating Spatial Audio from 360-Degree Video
Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models		-
Cross-modal generative model for visual-guided binaural stereo generation		-
2.5 d visual sound
Sep-stereo: Visually guided stereophonic audio generation by associating source separation
Interpretable binaural ratio for visually guided binaural audio generation		-
Visually Guided Binaural Audio Generation with Cross-Modal Consistency		-
Geometry-aware multi-task learning for binaural audio generation from video		-
Binaural audio generation via multi-task learning
Enhancing spatial audio generation with source separation and channel panning loss		-
Immersive spatial audio reproduction for vr/ar using room acoustic modelling from 360 images		-
Towards generating ambisonics using audio-visual cue for virtual reality
Self-supervised generation of spatial audio for 360 video
Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration
AV-Cloud: Spatial Audio Rendering Through Audio-Visual Cloud Splatting
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
Long-form music generation with latent diffusion
End-to-end binaural speech synthesis		-
Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content		-
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis

Table 8: The list of Spatial Audio Generation Papers and their URL

6. Spatial Audio Datasets

Dataset	Format	Collect	Hours	Type	Labels
Sweet-Home	Multi	Recorded	47.3	Speech	Text
Voice-Home	Multi	Recorded	2.5	Speech	Text, Geometric
YT-ALL & REC-STREET	FOA	Crawled	116.5	Audio	Video, Text
FAIR-Play	Binaural	Recorded	5.2	Audio	Video
SECL-UMons	Multi	Recorded	5	Audio	Text, Geometric
YT-360	FOA	Crawled	246	Audio	Video
EasyCom	Multi	Recorded	5	Speech	Geometric, Text
Binaural_Dataset	Binaural	Recorded	2	Speech	Geometric
SimBinaural	Binaural	Sim/Crawl	143	Audio	Video, Geometric
Spatial LibriSpeech	FOA	Simulated	650	Speech	Text, Geometric
STARSS23	FOA	Recorded	7.5	Audio	Video, Geometric
YT-Ambigen	FOA	Crawled	142	Audio	Video
BEWO-1M	Binaural	Simulated	2.8k	Audio	Text/Image, Geo
MRSDrama	Binaural	Recorded	98	Speech	Text, Video, Geo

Table 9: The list of Spatial Audio Datasets and their URL

Citations

If you find this code useful in your research, please cite our work:

@misc{zhu2025asaudiosurveyadvancedspatial,
      title={ASAudio: A Survey of Advanced Spatial Audio Research}, 
      author={Zhiyuan Zhu and Yu Zhang and Wenxiang Guo and Changhao Pan and Zhou Zhao},
      year={2025},
      eprint={2508.10924},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2508.10924}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ASAudio: A Survey of Advanced Spatial Audio Research

Zhiyuan Zhu, Yu Zhang, Wenxiang Guo, Changhao Pan, Zhou Zhao | Zhejiang University

News

🚀Quick Start

Introduction

Overall

Representations of Spatial Audio

1. Input Representations

2. Spatial Cues and Physical Modeling

2.1 Room Impulse Response (RIR)

2.2 Head Related Transfer Function (HRTF)

3. Output Representations

4. Spatial Audio Understanding Models

4.1 SELD Papers

4.2 Spatial Audio Separation Papers

4.3 Joint Learning Papers

5. Spatial Audio Generation Models

6. Spatial Audio Datasets

Citations

About

Uh oh!

Releases

Packages

License

dieKarotte/ASAudio

Folders and files

Latest commit

History

Repository files navigation

ASAudio: A Survey of Advanced Spatial Audio Research

Zhiyuan Zhu*, Yu Zhang*, Wenxiang Guo, Changhao Pan, Zhou Zhao | Zhejiang University

News

🚀Quick Start

Introduction

Overall

Representations of Spatial Audio

1. Input Representations

2. Spatial Cues and Physical Modeling

2.1 Room Impulse Response (RIR)

2.2 Head Related Transfer Function (HRTF)

3. Output Representations

4. Spatial Audio Understanding Models

4.1 SELD Papers

4.2 Spatial Audio Separation Papers

4.3 Joint Learning Papers

5. Spatial Audio Generation Models

6. Spatial Audio Datasets

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Zhiyuan Zhu, Yu Zhang, Wenxiang Guo, Changhao Pan, Zhou Zhao | Zhejiang University

Packages