Skip to content

dieKarotte/ASAudio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ASAudio: A Survey of Advanced Spatial Audio Research

Zhiyuan Zhu*, Yu Zhang*, Wenxiang Guo, Changhao Pan, Zhou Zhao | Zhejiang University

Resource list of ASAudio: A Survey of Advanced Spatial Audio Research.

arXiv GitHub Stars

We introduce ASAudio, a comprehensive survey covering the representations, understanding tasks, and generation tasks in spatial audio, as well as the relevant datasets and evaluation metrics.

In this repository, we provide links to related papers and their corresponding code.

We hope this helps you appreciate the fascinating world of spatial audio!

News

  • 2025.8.24: The survey is released on arXiv.

🚀Quick Start

Introduction

This repository is the official repository of the ASAudio: A Survey of Advanced Spatial Audio Research.

img1-paper-list

Figure 1: The timeline of spatial audio models & datasets in recent years.

Abstract

With the rapid development of spatial audio technologies today, applications in AR, VR and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outline existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives.

Overall

overall

Figure 2: Orgnization of this survey.

Representations of Spatial Audio

1. Input Representations

Attribute Natural Language Spatial Position Visual Information Monaural Audio
Primary Info Semantic, relational, implicit spatial Explicit spatial, dynamic Semantic, spatial, dynamic Acoustic (timbre, pitch, content)
Control Precision Low Very high High N/A
Abstraction Level High Low High Low
Interpretability Indirect Direct Indirect Indirect
Key Challenges Ambiguity; semantic–signal gap No semantics; tedious authoring Ambiguity; occlusion; compute cost Lack of spatial cues
Table 1: Comparative analysis of spatial audio input representations
representations Figure 1: The input representations and their fundamental processing steps.

2. Spatial Cues and Physical Modeling

2.1 Room Impulse Response (RIR)
Paper URL Code/Dataset
Few-shot audio-visual learning of environment acoustics arXiv Paper GitHub Code
Spatial scaper: a library to simulate and augment soundscapes for sound event localization and detection in realistic rooms arXiv Paper GitHub Code
Novel-view acoustic synthesis from 3D reconstructed rooms arXiv Paper GitHub Code
A binaural room impulse response database for the evaluation of dereverberation algorithms IEEE Paper MathWorks Product
The Sweet-Home speech and multimodal corpus for home automation interaction -
Dataset of Binaural Room Impulse Responses at Multiple Recording Positions, Source Positions, and Orientations in a Real Room DAGA 2017 Paper -
dEchorate: a calibrated room impulse response database for echo-aware signal processing arXiv Paper GitHub Code
BIRD: Big impulse response dataset arXiv Paper GitHub Code
Visually informed binaural audio generation without binaural audios arXiv Paper GitHub Code
MeshRIR: A dataset of room impulse responses on meshed grid points for evaluating sound field analysis and synthesis methods arXiv Paper GitHub Code
Mesh2ir: Neural acoustic impulse response generator for complex 3d scenes arXiv Paper GitHub Code
A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection arXiv Paper DCASE Dataset
Acoustic analysis and dataset of transitions between coupled rooms IEEE Paper -
Dataset of spatial room impulse responses in a variable acoustics room for six degrees-of-freedom rendering and analysis arXiv Paper Zenodo Dataset
On the authenticity of individual dynamic binaural synthesis JASA Paper -
Table 2: The list of RIR papers and their URL
2.2 Head Related Transfer Function (HRTF)
Paper URL Code/Dataset
HRTF personalization based on ear morphology Meta Publication -
On HRTF Notch Frequency Prediction using Anthropometric Features and Neural Networks arXiv Paper -
Magnitude modeling of personalized HRTF based on ear images and anthropometric measurements Open Access Paper -
Global HRTF interpolation via learned affine transformation of hyper-conditioned features arXiv Paper GitHub Code
HRTF recommendation based on the predicted binaural colouration model IEEE Paper -
Modeling individual head-related transfer functions from sparse measurements using a convolutional neural network Read on ResearchGate -
Head-related transfer function interpolation from spatially sparse measurements using autoencoder with source position conditioning arXiv Paper GitHub Code
HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection arXiv Paper GitHub Code
Spatial upsampling of head-related transfer functions using a physics-informed neural network arXiv Paper GitHub Code
HRTF field: Unifying measured HRTF magnitude representation with neural fields arXiv Paper GitHub Code
Head-related transfer function interpolation with a spherical CNN arXiv Paper GitHub Code
HRTF interpolation using a spherical neural process meta-learner arXiv Paper -
NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization arXiv Paper GitHub Code
Table 3: The list of HRTF papers and their URL

3. Output Representations

Attribute Channel-Based Scene-Based Object-Based
Freedom of Listening Position Limited High Moderate
Playback System Dependency Very high High Low
Scalability Low Moderate Excellent
Playback-End Complexity Low High Moderate
Common Formats Stereo; 5.1/7.1 surround Ambisonics; wave-field synthesis (WFS) Dolby Atmos; DTS:X; MPEG-H 3D Audio
Table 4: Comparative analysis of spatial audio output representations

4. Spatial Audio Understanding Models

4.1 SELD Papers

Link
Paper URL Code
Learning to localize sound source in visual scenes arXiv Paper GitHub Code
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning arXiv Paper GitHub Code
Classification of spatial audio location and content using convolutional neural networks Read on ResearchGate -
An improved event-independent network for polyphonic sound event localization and detection arXiv Paper GitHub Code
Sound event localization and detection of overlapping sources using convolutional recurrent neural networks arXiv Paper GitHub Code
Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection arXiv Paper GitHub Code
Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training arXiv Paper -
Sound event localization based on sound intensity vector refined by DNN-based denoising and source separation arXiv Paper -
Binaural source localization using deep learning and head rotation information IEEE Paper -
A sequence matching network for polyphonic sound event localization and detection IEEE Paper -
ACCDOA: Activity-coupled cartesian direction of arrival representation for sound event localization and detection arXiv Paper -
Binaural sound source distance estimation and localization for a moving listener IEEE Paper GitHub Code
Audio-visual event localization in unconstrained videos arXiv Paper -
w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training arXiv Paper GitHub Code
BAST: Binaural audio spectrogram transformer for binaural sound localization arXiv Paper GitHub Code
A time-domain unsupervised learning based sound source localization method IEEE Paper -
Sslide: Sound source localization for indoors based on deep learning arXiv Paper -
Semi-supervised source localization with deep generative modeling arXiv Paper -
Semi-supervised source localization in reverberant environments with deep generative modeling IEEE Paper -
Joint measurement of localization and detection of sound events IEEE Paper GitHub Code
3D localization of multiple sound sources with intensity vector estimates in single source zones IEEE Paper -
Self-supervised moving vehicle tracking with stereo sound arXiv Paper -
A probabilistic model for robust localization based on a binaural auditory front-end IEEE Paper -
Deepear: Sound localization with binaural microphones IEEE Paper -
Towards generating ambisonics using audio-visual cue for virtual reality IEEE Paper GitHub Code
AD-YOLO: You look only once in training multiple sound event localization and detection arXiv Paper -
Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy arXiv Paper GitHub Code
Sound event localization and detection using squeeze-excitation residual CNNs arXiv Paper GitHub Code
Table 5: The list of SELD Papers and their URL

4.2 Spatial Audio Separation Papers

Paper URL Code
Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation IEEE Paper -
Beamforming techniques for multichannel audio signal separation arXiv Paper -
Deep learning based binaural speech separation in reverberant environments IEEE Paper -
Source separation based on binaural cues and source model constraints Columbia University Paper -
Combining spectral and spatial features for deep learning based blind speaker separation IEEE Paper -
Real-time binaural speech separation with preserved spatial cues arXiv Paper -
2.5 d visual sound arXiv Paper GitHub Code
Lavss: Location-guided audio-visual spatial audio separation arXiv Paper GitHub Code
The cocktail party robot: Sound source separation and localisation with an active binaural head IEEE Paper -
The sound of pixels arXiv Paper GitHub Code
Self-supervised generation of spatial audio for 360 video arXiv Paper GitHub Code
Multichannel audio source separation with deep neural networks arXiv Paper -
Blind and neural network-guided convolutional beamformer for joint denoising, dereverberation, and source separation IEEE Paper -
Multi-microphone neural speech separation for far-field multi-talker speech recognition IEEE Paper -
Unsupervised Bayesian Surprise Detection in Spatial Audio with Convolutional Variational Autoencoder and LSTM Model ACM Paper -
Integration of variational autoencoder and spatial clustering for adaptive multi-channel neural speech separation arXiv Paper GitHub Code
Table 6: The list of Spatial Audio Separation Papers and their URL

4.3 Joint Learning Papers

Paper URL Code
Learning representations from audio-visual spatial alignment arXiv Paper -
Telling left from right: Learning spatial correspondence of sight and sound arXiv Paper GitHub Code
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis arXiv Paper GitHub Code
Learning neural acoustic fields arXiv Paper GitHub Code
Overview of geometrical room acoustic modeling techniques JASA Paper -
Av-rir: Audio-visual room impulse response estimation arXiv Paper GitHub Code
Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation IEEE Paper -
Multi-Channel Mosra: Mean Opinion Score and Room Acoustics Estimation Using Simulated Data and A Teacher Model arXiv Paper -
Few-shot audio-visual learning of environment acoustics arXiv Paper GitHub Code
Blind room parameter estimation using multiple multichannel speech recordings arXiv Paper GitHub Code
Visual-based spatial audio generation system for multi-speaker environments arXiv Paper -
Learning Spatially-Aware Language and Audio Embeddings arXiv Paper -
BAT: Learning to Reason about Spatial Sounds with Large Language Models arXiv Paper GitHub Code
Table 7: The list of Joint Learning Papers and their URL

5. Spatial Audio Generation Models

Paper URL Code
A structural model for binaural sound synthesis IEEE Paper -
Neural synthesis of binaural speech from mono audio OpenReview Paper GitHub Code
2.5 d visual sound arXiv Paper GitHub Code
Cyclic Learning for Binaural Audio Generation and Localization IEEE Paper -
Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention arXiv Paper -
Geometry-aware multi-task learning for binaural audio generation from video arXiv Paper -
Multi-attention audio-visual fusion network for audio spatialization ACM Paper -
Visually Guided Binaural Audio Generation with Cross-Modal Consistency IEEE Paper -
Interpretable binaural ratio for visually guided binaural audio generation IEEE Paper -
Cross-modal generative model for visual-guided binaural stereo generation arXiv Paper -
Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis arXiv Paper GitHub Code
DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect arXiv Paper -
Neural fourier shift for binaural speech rendering arXiv Paper GitHub Code
Visually informed binaural audio generation without binaural audios arXiv Paper GitHub Code
Localize to binauralize: Audio spatialization from visual sound source localization IEEE Paper GitHub Code
Sep-stereo: Visually guided stereophonic audio generation by associating source separation arXiv Paper GitHub Code
Exploiting audio-visual consistency with partial supervision for spatial audio generation arXiv Paper -
Binaural audio generation via multi-task learning arXiv Paper GitHub Code
End-to-end binaural speech synthesis arXiv Paper -
Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content arXiv Paper -
ViSAGe: Video-to-Spatial Audio Generation arXiv Paper GitHub Code
OmniAudio: Generating Spatial Audio from 360-Degree Video arXiv Paper GitHub Code
Towards generating ambisonics using audio-visual cue for virtual reality IEEE Paper GitHub Code
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis arXiv Paper GitHub Code
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model arXiv Paper -
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation arXiv Paper GitHub Code
DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model arXiv Paper -
Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models arXiv Paper -
Ambisonizer: Neural upmixing as spherical harmonics generation arXiv Paper GitHub Code
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting arXiv Paper GitHub Code
Simple and controllable music generation arXiv Paper GitHub Code
Moûsai: Text-to-music generation with long-context latent diffusion arXiv Paper GitHub Code
Long-form music generation with latent diffusion arXiv Paper GitHub Code
Listen2scene: Interactive material-aware binaural sound propagation for reconstructed 3d scenes arXiv Paper Project Page
Immersive spatial audio reproduction for vr/ar using room acoustic modelling from 360 images IEEE Paper -
See-2-sound: Zero-shot spatial environment-to-spatial sound arXiv Paper GitHub Code
Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis arXiv Paper GitHub Code
See-2-sound: Zero-shot spatial environment-to-spatial sound arXiv Paper GitHub Code
DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model arXiv Paper -
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model arXiv Paper -
SonicMotion: Dynamic Spatial Audio Soundscapes with Latent Diffusion Models arXiv Paper -
Simple and controllable music generation arXiv Paper GitHub Code
Ambisonizer: Neural upmixing as spherical harmonics generation arXiv Paper GitHub Code
ViSAGe: Video-to-Spatial Audio Generation arXiv Paper GitHub Code
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting arXiv Paper GitHub Code
OmniAudio: Generating Spatial Audio from 360-Degree Video arXiv Paper GitHub Code
Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models arXiv Paper -
Cross-modal generative model for visual-guided binaural stereo generation arXiv Paper -
2.5 d visual sound arXiv Paper GitHub Code
Sep-stereo: Visually guided stereophonic audio generation by associating source separation arXiv Paper GitHub Code
Interpretable binaural ratio for visually guided binaural audio generation IEEE Paper -
Visually Guided Binaural Audio Generation with Cross-Modal Consistency IEEE Paper -
Geometry-aware multi-task learning for binaural audio generation from video arXiv Paper -
Binaural audio generation via multi-task learning arXiv Paper GitHub Code
Enhancing spatial audio generation with source separation and channel panning loss IEEE Paper -
Immersive spatial audio reproduction for vr/ar using room acoustic modelling from 360 images IEEE Paper -
Towards generating ambisonics using audio-visual cue for virtual reality IEEE Paper GitHub Code
Self-supervised generation of spatial audio for 360 video arXiv Paper GitHub Code
Sonic4D: Spatial Audio Generation for Immersive 4D Scene Exploration arXiv Paper GitHub Code
AV-Cloud: Spatial Audio Rendering Through Audio-Visual Cloud Splatting OpenReview Paper GitHub Code
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation arXiv Paper GitHub Code
Long-form music generation with latent diffusion arXiv Paper GitHub Code
End-to-end binaural speech synthesis arXiv Paper -
Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content arXiv Paper -
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis arXiv Paper GitHub Code
Table 8: The list of Spatial Audio Generation Papers and their URL

6. Spatial Audio Datasets

Dataset Format Collect Hours Type Labels URL
Sweet-Home Multi Recorded 47.3 Speech Text
Voice-Home Multi Recorded 2.5 Speech Text, Geometric HAL Paper
YT-ALL & REC-STREET FOA Crawled 116.5 Audio Video, Text arXiv Paper
FAIR-Play Binaural Recorded 5.2 Audio Video arXiv Paper
SECL-UMons Multi Recorded 5 Audio Text, Geometric IEEE Paper
YT-360 FOA Crawled 246 Audio Video arXiv Paper
EasyCom Multi Recorded 5 Speech Geometric, Text arXiv Paper
Binaural_Dataset Binaural Recorded 2 Speech Geometric OpenReview Paper
SimBinaural Binaural Sim/Crawl 143 Audio Video, Geometric UT Austin Paper
Spatial LibriSpeech FOA Simulated 650 Speech Text, Geometric arXiv Paper
STARSS23 FOA Recorded 7.5 Audio Video, Geometric arXiv Paper
YT-Ambigen FOA Crawled 142 Audio Video arXiv Paper
BEWO-1M Binaural Simulated 2.8k Audio Text/Image, Geo arXiv Paper
MRSDrama Binaural Recorded 98 Speech Text, Video, Geo arXiv Paper
Table 9: The list of Spatial Audio Datasets and their URL

Citations

If you find this code useful in your research, please cite our work:

@misc{zhu2025asaudiosurveyadvancedspatial,
      title={ASAudio: A Survey of Advanced Spatial Audio Research}, 
      author={Zhiyuan Zhu and Yu Zhang and Wenxiang Guo and Changhao Pan and Zhou Zhao},
      year={2025},
      eprint={2508.10924},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2508.10924}, 
}

visitors

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published