关于自我注意力对变形金刚自动识别的有用性

论文标题

关于自我注意力对变形金刚自动识别的有用性

On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

论文作者

Zhang, Shucong, Loweimi, Erfan, Bell, Peter, Renals, Steve

论文摘要

可以在不受事件之间距离的情况下捕获时间关系的自我注意力模型，从而给出了竞争性的语音识别结果。但是，我们注意到学习环境的范围从下部到上层自我发项层增加，而声学事件通常会在短时间内以左右顺序跨度发生。这导致了一个问题：为了进行语音识别，对整个序列的全球视图是否对变形金刚中的上层自我发病编码器有用？为了调查这一点，我们在华尔街日报和总机上使用编码较低的自我注意/上馈层层进行训练。与基线变压器相比，没有性能下降，但是观察到较小的收益。我们进一步开发了一个新颖的注意力矩阵对角线度的指标，发现学习的对角性确实从下部编码器自我发场层增加。我们得出结论，在培训上层编码器层中，全球观点是不必要的。

Self-attention models such as Transformers, which can capture temporal relationships without being limited by the distance between events, have given competitive speech recognition results. However, we note the range of the learned context increases from the lower to upper self-attention layers, whilst acoustic events often happen within short time spans in a left-to-right order. This leads to a question: for speech recognition, is a global view of the entire sequence useful for the upper self-attention encoder layers in Transformers? To investigate this, we train models with lower self-attention/upper feed-forward layers encoders on Wall Street Journal and Switchboard. Compared to baseline Transformers, no performance drop but minor gains are observed. We further developed a novel metric of the diagonality of attention matrices and found the learned diagonality indeed increases from the lower to upper encoder self-attention layers. We conclude the global view is unnecessary in training upper encoder layers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题