The L3DAS22 Challenge aims at encouraging and fostering research on machine learning for 3D audio signal processing.
3D audio is gaining increasing interest in the machine learning community in recent years. The range of applications is incredibly wide, extending from virtual and real conferencing to autonomous driving, surveillance and many more. In these contexts, a fundamental procedure is to properly identify the nature of events present in a soundscape, their spatial position and eventually remove unwanted noises that can interfere with the useful signal. To this end, L3DAS22 Challenge presents two tasks: 3D Speech Enhancement and 3D Sound Event Localization and Detection, both relying on first-order Ambisonics recordings in reverberant office environments.
Each task involves 2 separate tracks: 1-mic and 2-mic recordings, respectively containing sounds acquired by one 1st order Ambisonics microphone and by an array of two ones. The use of two Ambisonics microphones represents one of the main novelties of the L3DAS22 Challenge. We expect higher accuracy/reconstruction quality when taking advantage of the dual spatial perspective of the two microphones. Moreover, we are very interested in identifying other possible advantages of this configuration over standard Ambisonics formats.
▪ Nov 22, 2021 - Release of the Training and Development Sets, Code, Baseline Methods and Documentation
▪ Dec 15, 2021 Jan 5, 2022 – Release of the Evaluation Test Set
▪ Jan 5, 2022 – Registration Closing
▪ Dec 22, 2021 Jan 10, 2022 – Deadline for Submitting Results for Both Tasks
▪ Jan 5, 2022 Jan 20, 2022 – Notification of Top Ranked Teams
▪ Jan 20, 2022 Jan 31, 2022 – Deadline for Paper Submission (Top Ranked 5 Only)
▪ Feb 10, 2022 – Grand Challenge Paper Acceptance Notification
▪ Feb 16, 2022 – Camera-Ready Grand Challenge Papers Deadline
▪ May 7, 2022 – Virtual session at IEEE ICASSP 2022 (link )
▪ May 22, 2022 – Opening of the IEEE ICASSP 2022 and Winner Announcement
The tasks we propose are:
The objective of this task is the enhancement of speech signals immersed in the spatial sound field of a reverberant office environment. Here the models are expected to extract the monophonic voice signal from the 3D mixture containing various background noises. The evaluation metric for this task is a combination of short-time objective intelligibility (STOI) and word error rate (WER).
The aim of this task is to detect the temporal activities of a known set of sound event classes and, in particular, to further locate them in the space. Here the models must predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. Performance on this task is evaluated according to the location-sensitive detection error, which joins the localization and detection error metrics.
The L3DAS22 dataset contains multiple-source and multiple-perspective B-format Ambisonics audio recordings. We sampled the acoustic field of a large office room, placing two first-order Ambisonics microphones in the center of the room and moving a speaker reproducing the analytic signal in 252 fixed spatial positions.
We aimed at creating plausible and variegate 3D scenarios to reflect possible real-life situations in which sound and disparate types of background noises coexist in the same 3D reverberant environment.