论文标题
语义对准语音文本嵌入的分析
An Analysis of Semantically-Aligned Speech-Text Embeddings
论文作者
论文摘要
嵌入在多模式语言处理问题的端到端解决方案中起重要作用。尽管已经努力理解单模式嵌入空间的特性,尤其是文本的空间,但它们的跨模式对应物的了解程度较低。在这项工作中,我们研究了联合语音文本嵌入空间的一些固有特性,该特性是通过最大程度地减少教师模型设置中配对的话语和转录输入之间的距离而构建的,对于几种突出的用例而言,这些模型都提供了信息。我们发现,通过预训练和多任务场景结合自动语音识别,可显着帮助语义对齐,从而更紧密地耦合嵌入。为了分析跨模式的嵌入,我们利用定量检索精度度量来进行语义一致性,零击分类的概括性和探测编码器来观察从一种方式转移到另一种方式的知识传递程度。
Embeddings play an important role in end-to-end solutions for multi-modal language processing problems. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their cross-modal counterparts are less understood. In this work, we study some intrinsic properties of a joint speech-text embedding space, constructed by minimizing the distance between paired utterance and transcription inputs in a teacher-student model setup, that are informative for several prominent use cases. We found that incorporating automatic speech recognition through both pretraining and multitask scenarios aid semantic alignment significantly, resulting in more tightly coupled embeddings. To analyse cross-modal embeddings we utilise a quantitative retrieval accuracy metric for semantic alignment, zero-shot classification for generalisability, and probing of the encoders to observe the extent of knowledge transfer from one modality to another.