The L3DAS23 Challenge aims at encouraging and fostering research on machine learning for 3D audio signal processing.
3D audio applications in virtual environments are gaining increasing interest in the machine learning community in recent years. This field of applications is incredibly wide and ranges from virtual and real conferencing to game development, music production, augmented reality and immersive technologies.
This challenge, which extends the two tasks of the L3DAS22 Grand Challenge presented at ICASSP 2022, relies on first-order Ambisonics recordings in reverberant simulated environments while paying a special attention on possible augmented reality applications. To this end, L3DAS23 presents two tasks: 3D Speech Enhancement and 3D Sound Event Localization and Detection.
Each task is accompanied by a dataset containing recordings and pictures showing the frontal view from the microphone, which may be used to extract visual cues that can enhance the models performance. Therefore, each task involves 2 separate tracks: audio-only and audio-visual track, where each provides two subtracks, i.e. 1-mic and 2-mic recordings.
We expect higher accuracy/reconstruction quality in the case of the audio-visual track, especially when taking advantage of the dual spatial perspective of the two microphones.
▪ Nov 28, 2022 - Registration Opening
▪ Dec 15, 2022 - Release of the Training and Development Sets and Documentation
▪ Jan 15, 2023 - Release of the Support Code
▪ Jan 15, 2023 - Release of the Baseline Models
▪ Feb 05, 2023 – Release of the Evaluation Test Set
▪ Feb 10, 2023 – Registration Closing
▪ Feb 15, 2023 Feb 19, 2023 – Deadline for Submitting Results
▪ Feb 20, 2023 3:00 a.m. (AoE)– Notification of Top Ranked Teams
▪ Feb 20, 2023 – Deadline for 2-page Paper Submission (Top Ranked 5 Only)
▪ Mar 7, 2023 – Grand Challenge Paper Acceptance Notification
▪ Mar 14, 2023 – Camera-Ready Grand Challenge Papers Deadline
The tasks we propose are:
The objective of this task is the enhancement of speech signals immersed in the spatial sound field of a reverberant simulated environments. Here the models are expected to extract the monophonic voice signal from the 3D mixture containing various background noises. The evaluation metric for this task is a combination of short-time objective intelligibility (STOI) and word error rate (WER).
The aim of this task is to detect the temporal activities of a known set of sound event classes and, in particular, to further locate them in the space. Here the models must predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. Performance on this task is evaluated according to the location-sensitive detection error, which joins the localization and detection error metrics.
Each of the above two tasks is supported by an appropriate dataset. The L3DAS23 datasets contains multiple-source and multiple-perspective B-format Ambisonics audio recordings. We sampled the acoustic field of multiple simulated environments, placing two first-order Ambisonics microphones in random points of the rooms and capturing up to 737 room impulse responses in each one. The datasets also contain multiple RGB pictures showing the frontal view from the main microphone.
We aimed at creating plausible and variegate 3D scenarios to reflect possible real-life situations in which sound and disparate types of background noises coexist in the same 3D reverberant environment.
As baseline methods we propose similar architectures to those used as baseline for L3DAS22, specifically adapted for each track. For both tasks, we used the only signals coming from one Ambisonics microphone (mic A), leaving room for experimentation with the dual-mic configuration.