通过概率空间建模，自我监督的神经视听声音源定位

论文标题

通过概率空间建模，自我监督的神经视听声音源定位

Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

论文作者

Masuyama, Yoshiki, Bando, Yoshiaki, Yatabe, Kohei, Sasaki, Yoko, Onishi, Masaki, Oikawa, Yasuhiro

论文摘要

在视觉观察中检测声源对象对于自主机器人理解周围环境很重要。由于在我们的生活环境中，声音对象具有不同的外观，因此在实践中标记所有声音对象是不可能的。这需要进行自我监督的学习，这不需要手动标签。大多数传统的自我监管学习都使用单声音音频信号和图像，并且由于音频信号中的空间信息差而无法区分具有相似外观的声源对象。为了解决此问题，本文使用360°图像和多通道音频信号提出了一种自制的训练方法。通过与多通道音频信号中的空间信息合并，我们的方法训练深神经网络（DNN）以区分多个声源对象。我们用于将声源对象定位在图像中的系统由音频和视觉DNN组成。视觉DNN经过训练，可以在输入图像中定位声源候选。音频DNN验证了每个候选人是否真正产生声音。这些DNN是基于概率空间音频模型以自制的方式共同训练的。使用模拟数据的实验结果表明，通过我们方法训练的DNN局部多个扬声器。我们还证明了视觉DNN检测到的对象，包括会说话的访问者和科学博物馆中记录的真实数据的特定展览。

Detecting sound source objects within visual observation is important for autonomous robots to comprehend surrounding environments. Since sounding objects have a large variety with different appearances in our living environments, labeling all sounding objects is impossible in practice. This calls for self-supervised learning which does not require manual labeling. Most of conventional self-supervised learning uses monaural audio signals and images and cannot distinguish sound source objects having similar appearances due to poor spatial information in audio signals. To solve this problem, this paper presents a self-supervised training method using 360° images and multichannel audio signals. By incorporating with the spatial information in multichannel audio signals, our method trains deep neural networks (DNNs) to distinguish multiple sound source objects. Our system for localizing sound source objects in the image is composed of audio and visual DNNs. The visual DNN is trained to localize sound source candidates within an input image. The audio DNN verifies whether each candidate actually produces sound or not. These DNNs are jointly trained in a self-supervised manner based on a probabilistic spatial audio model. Experimental results with simulated data showed that the DNNs trained by our method localized multiple speakers. We also demonstrate that the visual DNN detected objects including talking visitors and specific exhibits from real data recorded in a science museum.

下载PDF全文

下载文献需遵守相关版权规定

论文标题