Metric: 0.8651
WER: 0.1558
STOI: 0.8860
In this work, the authors propose a new neural beamforming network for B-format 3D multi-channel speech enhancement and recognition. It incorporates the traditional beamforming structure with the deep neural network specifically for the B-format channels (the first-order Ambisonics).
More on the official paper “A Neural Beamforming Network for B-Format 3D Speech Enhancement and Recognition” presented at the MLSP21 conference.
Metric: 0.7563
WER: 0.3132
STOI: 0.8257
In this work, the authors propose a novel approach to 3D speech enhancement directly in the time domain through the usage of Fully Convolutional Networks (FCN) with a custom loss function based on the combination of a perceptual loss, built on top of the wav2vec model and a soft version of the short-time objective intelligibility (STOI) metric.
More on the official paper “Optimizing Time Domain Fully Convolutional Networks for 3D Speech Enhancement in a Reverberant Environment Using Perceptual Losses” presented at the MLSP21 conference.
Note: The challenge metric is computed as (STOI+(1-WER))/2