The aim of this task is to detect the temporal activity, spatial position and typology of a known set of sound events immersed in a synthetic 3D acoustic environment. We consider up to 3 simultaneously active sounds, which may belong to the same class. Here the models are expected to predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. We use a joint metric for localization and detection: F-score based on the Location-sensitive detection.
This metric considers a true positive only if a sound class is correctly predicted in a temporal frame and if it its predicted location lies within a fixed cartesian distance from the true position.
To generate the spatial sound scenes the measured room IRs are convolved with clean sound samples belonging to distinct sound classes. The sound event database we used for task 2 is the well-known FSD50K dataset. In particular, we have selected 14 classes, representative of the sounds that can be heard in an office: computer keyboard, drawer open/close, cupboard open/close, finger snapping, keys jangling, knock, laughter, scissors, telephone, writing, chink and clink, printer, female speech, male speech.
The main characteristics of the L3DAS21 SELD dataset are: