bast：双耳音频图形变压器用于双耳声音定位

论文标题

bast：双耳音频图形变压器用于双耳声音定位

BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization

论文作者

Kuang, Sheng, Shi, Jie, van der Heijden, Kiki, Mehrkanoon, Siamak

论文摘要

混响环境中的准确声音定位对于人类听觉感知至关重要。最近，卷积神经网络（CNN）已用于对双耳人类听觉途径进行建模。但是，CNN显示出捕获全球声学特征的障碍。为了解决这个问题，我们提出了一种新型的端到端双耳音频谱图变压器（BAST）模型，以预测态和混响环境中的声音方位角。探索了两种模式的实现模式，即分别与具有共享和非共享参数的BAST模型相对应的BAST-SP和BAST-NSP。我们使用减法的模型和杂种损失的模型达到了1.29度的角度距离，并且在所有方位角都达到1E-3的均方误差，显着超过了基于CNN的模型。对Bast在左右半菲尔德的表现以及无声和混响环境的探索性分析显示了其泛化能力以及双耳变形金刚在声音定位中的可行性。此外，提供了注意图的分析，以提供有关自然混响环境中本地化过程的解释的更多见解。

Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows barriers in capturing the global acoustic features. To address this issue, we propose a novel end-to-end Binaural Audio Spectrogram Transformer (BAST) model to predict the sound azimuth in both anechoic and reverberation environments. Two modes of implementation, i.e. BAST-SP and BAST-NSP corresponding to BAST model with shared and non-shared parameters respectively, are explored. Our model with subtraction interaural integration and hybrid loss achieves an angular distance of 1.29 degrees and a Mean Square Error of 1e-3 at all azimuths, significantly surpassing CNN based model. The exploratory analysis of the BAST's performance on the left-right hemifields and anechoic and reverberation environments shows its generalization ability as well as the feasibility of binaural Transformers in sound localization. Furthermore, the analysis of the attention maps is provided to give additional insights on the interpretation of the localization process in a natural reverberant environment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题