文本，声学和基于晶格的表示语言理解任务的有效性

论文标题

文本，声学和基于晶格的表示语言理解任务的有效性

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

论文作者

Villatoro-Tello, Esaú, Madikeri, Srikanth, Zuluaga-Gomez, Juan, Sharma, Bidisha, Sarfjoo, Seyyed Saeed, Nigmatulina, Iuliia, Motlicek, Petr, Ivanov, Alexei V., Ganapathiraju, Aravind

论文摘要

在本文中，我们对不同表示形式进行了详尽的评估，以解决语言理解（SLU）设置中的意图分类问题。我们基于执行SLU意图检测任务的三种类型的系统：1）基于文本的系统，2）基于晶格的系统和新颖的3）多模式方法。我们的工作提供了有关在不同情况下，例如自动与手动生成的成绩单的不同最先进的SLU系统可以实现的性能的全面分析。我们在公开可用的口头语言资源语料库中评估系统。我们的结果表明，使用更丰富形式的自动语音识别（ASR）输出，即单词传感器网络，使SLU系统与1-最佳设置相比可以改进（5.5％的相对改进）。但是，跨模式方法，即从声学和文本嵌入中学习，获得了类似于Oracle设置的性能，在1好的配置上相对提高了17.8％，这是克服自动产生的转录本工作的局限性的推荐替代方法。

In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题