与基于CTC的语音活动检测集成的端到端自动语音识别

论文标题

与基于CTC的语音活动检测集成的端到端自动语音识别

End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice Activity Detection

论文作者

Yoshimura, Takenori, Hayashi, Tomoki, Takeda, Kazuya, Watanabe, Shinji

论文摘要

本文将语音活动检测（VAD）功能与端到端的自动语音识别集成到在线语音界面并转录很长的录音。我们专注于连接主义时间分类（CTC）及其对CTC/注意体系结构的扩展。与基于注意力的体系结构相比，可以根据CTC（Pre-）软键输出进行贪婪的搜索来执行输入同步标签预测。该预测包括连续的长空白标签，可以将其视为非语音区域。我们使用标签作为用于检测具有简单阈值的语音段的提示。阈值与非语音区域的长度直接相关，非语音区域的长度比传统的VAD超参数更直观，更易于控制。对未分段数据的实验结果表明，该提出的方法使用基于基于能量的和神经网络的VAD方法优于基线方法，并获得了小于0.2的RTF。提出的方法可公开使用。

This paper integrates a voice activity detection (VAD) function with end-to-end automatic speech recognition toward an online speech interface and transcribing very long audio recordings. We focus on connectionist temporal classification (CTC) and its extension of CTC/attention architectures. As opposed to an attention-based architecture, input-synchronous label prediction can be performed based on a greedy search with the CTC (pre-)softmax output. This prediction includes consecutive long blank labels, which can be regarded as a non-speech region. We use the labels as a cue for detecting speech segments with simple thresholding. The threshold value is directly related to the length of a non-speech region, which is more intuitive and easier to control than conventional VAD hyperparameters. Experimental results on unsegmented data show that the proposed method outperformed the baseline methods using the conventional energy-based and neural-network-based VAD methods and achieved an RTF less than 0.2. The proposed method is publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题