For Task 1 (SE), we use a beamforming U-Net architecture [1], which provided the best metrics for the L3DAS21 Challenge on the SE task. This network uses a convolutional UNet to estimate B-format beamforming filters and contains three main modules: encoder for extracting high-level features gradually, decoder for reconstructing the size of input features from the output of the encoder, and skip connections for concatenating each layer in the encoder with its corresponding layer in the decoder. The enhancement process is performed as that of the traditional signal beamforming.
We multiply the complex spectrogram of B-format noisy signal with the filters estimated by U-Net through element-wise multiplication, and then sum the result over the channel axis to estimate a single-channel enhanced complex spectrogram.
In the end, the ISTFT is performed to obtain the enhanced time-domain signal.
Figure 1: Representation of the architecture used for Task 1. Source: [1]
With this model we obtained a baseline test metric for Task 1 of 0.557, with a word error rate of 0.57 and a STOI of 0.68.
We adapted this model to the audiovisual task by using a CNN-based extension part whose output features are concatened along the filters dimension with those generated by the encoder part of the U-Net. Since we did not obtain significantly improved results, we will consider for the purposes of the challenge the same baseline values produced in the audio-only case. The visual features, however, allow a slight decrease in the number of epochs required to achieve results comparable according to metrics to those of the audio-only track.
We are particularly curious to see the methods adopted by participants to obtain highly informative visual features, especially given the imbalance between the number of images and datasamples available.
For Task 2, instead, we used a variant of the SELDnet architecture [2], with small changes w.r.t. that used in the L3DAS22 Challenge. We ported to the PyTorch language the original Keras implementation and we modified its structure in order to make it compatible with the L3DAS23 dataset.
The objective of this network is to output a continuous estimation (within a fixed temporal grid) of the sounds present in the environment and their respective location. The original SELDNet architecture is conceived for processing sound spectrograms (including both magnitudes and phase information) and uses a convolutional-recurrent feature extractor based on 3 convolution layers followed by a bidirectional GRU layer. In the end, the network is split in two separate branches that predict the detection (which classes are active) and location (where the sounds are) information for each target time step.
We augmented the capacity of the network by increasing the number of channels and layers, while maintaining the original data flow. Moreover, we discard the phase information and we perform max-pooling on both the time and the frequency dimensions, as opposed to the original implementation, where only frequency-wise max-pooling is performed. In addition, we added the ability to detect multiple sound sources of the same class that may be active at the same time (3 at maximum in our case). To obtain this behavior we tripled the size of the network’s output matrix, in order to predict separate location and detection information for all possible simultaneous sounds of the same class.
This network obtains a baseline test F-score of 0.147, with a precision of 0.176 and a recall of 0.126.
We adapted this model to the audiovisual task by using a CNN-based extension part whose output features are concatened to the ones our augmented SELD just before passing them to the two separate branches. This simple change resulted in a F-score of 0.158, with a precision of 0.182 and a recall of 0.140.
[1] Xinlei Ren, Lianwu Chen, Xiguang Zheng, Chenglin Xu, Xu Zhang, Chen Zhang, Liang Guo, and Bing Yu, “A neural beamforming network for b-format 3d speech enhancement and recognition,” 2021 IEEE International Workshops on Machine Learning for Signal Processing, L3DAS21 Challenge track, Oct. 25–28 2021
[2] Sharath Adavanne, Archontis Politis, Joonas Nikunen, and Tuomas Virtanen, “Sound event localization and detection of overlapping sources using convolutional recurrent neural networks,” IEEE J. Sel. Top. Signal Process., vol. 13, no. 1, pp. 34–48, 2019.