什么时候可以用饲料向前层代替自我？

论文标题

什么时候可以用饲料向前层代替自我？

When Can Self-Attention Be Replaced by Feed Forward Layers?

论文作者

Zhang, Shucong, Loweimi, Erfan, Bell, Peter, Renals, Steve

论文摘要

最近，与语音识别中的复发性神经网络系统相比，诸如变形金刚之类的自我注意力模型取得了竞争性结果。自我注意力模型出色表现的关键因素是它们捕获时间关系的能力而不会受到两个相关事件之间的距离的限制。但是，我们注意到，学习环境的范围逐渐从下部自我发场层逐渐增加，而声学事件通常会在短时间内以从左到右的顺序进行。这导致了一个问题：为了进行语音识别，整个顺序的全球视图是否对变形金刚编码器中的上层自我发挥层仍然很重要？为了调查这一点，我们用前馈替换了这些自我发挥的层。在我们的语音识别实验（《华尔街日报》和《总机》）中，我们确实观察到了一个有趣的结果：用饲料向前层代替编码器中的上层自我发项层，这不会导致性能下降，甚至没有较小的收益。我们的实验提供了有关自我注意力层如何处理语音信号的见解，得出的结论是，编码器的较低自我注意力层编码了足够广泛的输入，因此不需要在上层学习进一步的上下文信息。

Recently, self-attention models such as Transformers have given competitive results compared to recurrent neural network systems in speech recognition. The key factor for the outstanding performance of self-attention models is their ability to capture temporal relationships without being limited by the distance between two related events. However, we note that the range of the learned context progressively increases from the lower to upper self-attention layers, whilst acoustic events often happen within short time spans in a left-to-right order. This leads to a question: for speech recognition, is a global view of the entire sequence still important for the upper self-attention layers in the encoder of Transformers? To investigate this, we replace these self-attention layers with feed forward layers. In our speech recognition experiments (Wall Street Journal and Switchboard), we indeed observe an interesting result: replacing the upper self-attention layers in the encoder with feed forward layers leads to no performance drop, and even minor gains. Our experiments offer insights to how self-attention layers process the speech signal, leading to the conclusion that the lower self-attention layers of the encoder encode a sufficiently wide range of inputs, hence learning further contextual information in the upper layers is unnecessary.

下载PDF全文

下载文献需遵守相关版权规定

论文标题