The aim of this task is to detect the temporal activity, spatial position and typology of a known set of sound events immersed in a set of simulated 3D acoustic environments. We consider up to 3 simultaneously active sounds, which may belong to the same class. Here the models are expected to predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. We use a joint metric for localization and detection: F-score based on the Location-sensitive detection [1].
This metric considers a true positive only if a sound class is correctly predicted in a temporal frame and if it its predicted location lies within a cartesian distance from the true position of at most 1.75 m.
To generate the spatial sound scenes the measured room IRs are convolved with clean sound samples belonging to distinct sound classes. The sound event database we used for Task 2 is the well-known FSD50K dataset. In particular, we have selected 14 classes, representative of the sounds that can be heard in an office: computer keyboard, drawer open/close, cupboard open/close, finger snapping, keys jangling, knock, laughter, scissors, telephone, writing, chink and clink, printer, female speech, male speech.
The main characteristics of the L3DAS23 SELD dataset are:
The Task 2 dataset is organized as follows:
where X is in the range [0,3] in the train set and X=4 in the dev set; Y is an incremental number; ov1 stands for maximum one overlapping sound, ov2 for maximum two overlapping sounds, and ov3 for maximum three overlapping sounds.
The Task 2 dataset can be downloaded from Kaggle, either through the appropriate API or from the appropriate web page
[1] A. Mesaros, S. Adavanne, A. Politis, T. Heittola and T. Virtanen, "Joint Measurement of Localization and Detection of Sound Events," 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019, pp. 333-337, doi: 10.1109/WASPAA.2019.8937220.