概念梁：概念驱动的目标语音提取

论文标题

概念梁：概念驱动的目标语音提取

ConceptBeam: Concept Driven Target Speech Extraction

论文作者

Ohishi, Yasunori, Delcroix, Marc, Ochiai, Tsubasa, Araki, Shoko, Takeuchi, Daiki, Niizumi, Daisuke, Kimura, Akisato, Harada, Noboru, Kashino, Kunio

论文摘要

我们提出了一个基于语义信息的目标语音提取的新颖框架，称为ConceptBeam。目标语音提取是指在混合物中提取目标扬声器的语音。典型的方法一直在利用音频信号的性能，例如谐波结构和到达方向。相反，ConceptBeam通过语义线索解决了问题。具体而言，我们使用概念指定者（例如图像或语音）提取说话者谈论概念的演讲，即感兴趣的话题。解决这一新颖的问题将为对话中讨论的特定主题等创新应用程序打开大门。与关键字不同，概念是抽象的概念，使直接代表目标概念的挑战。在我们的方案中，通过将概念规范映射到共享的嵌入空间来编码为语义嵌入。可以使用由图像及其口语字幕组成的配对数据进行深度度量学习来构建这种与模式无关的空间。我们使用它来桥接依赖于模态的信息，即混合物中的语音段以及指定的独立于模态的概念。为了证明我们的方案，我们使用一组与口语字幕相关的图像进行了实验。也就是说，我们从这些口语字幕中产生了语音混合物，并将图像或语音信号用作概念指示符。然后，我们使用已识别段的声学特征提取目标语音。我们将ConceptBeam与两种方法进行比较：一种基于从识别系统获得的关键字，另一个基于声音源分离。我们表明，概念束明显优于基线方法，并根据语义表示有效提取语音。

We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image or speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space. This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept. As a proof of our scheme, we performed experiments using a set of images associated with spoken captions. That is, we generated speech mixtures from these spoken captions and used the images or speech signals as the concept specifiers. We then extracted the target speech using the acoustic characteristics of the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题