论文标题
用于机器翻译的多通压变压器
Multi-Pass Transformer for Machine Translation
论文作者
论文摘要
与以前的方法相反,信息仅流向堆栈的更深层,我们考虑了一个多通压变压器(MPT)体系结构,在该体系结构中,允许较早的层根据后期层的输出来处理信息。为了维持有向的无环图结构,沿新的多通尺寸重复变压器的编码器堆栈,保持参数绑定,并允许信息在编码器堆栈内和任何后续堆栈的任何层中单向朝着更深的层进行单向朝向更深的层。我们考虑了平行编码器堆栈之间的软(即连续)和硬(即离散的)连接,依靠神经体系结构搜索来在困难情况下找到最佳的连接模式。我们对拟议的MPT体系结构进行了广泛的消融研究,并将其与其他最先进的变压器体系结构进行了比较。出乎意料的是,配备了MPT的基本变压器可以超过具有挑战性的机器翻译EN-DE和EN-FR数据集上大型变压器的性能。在硬连接案例中,为EN-DE找到的最佳连接模式还可以改善En-FR的性能。
In contrast with previous approaches where information flows only towards deeper layers of a stack, we consider a multi-pass transformer (MPT) architecture in which earlier layers are allowed to process information in light of the output of later layers. To maintain a directed acyclic graph structure, the encoder stack of a transformer is repeated along a new multi-pass dimension, keeping the parameters tied, and information is allowed to proceed unidirectionally both towards deeper layers within an encoder stack and towards any layer of subsequent stacks. We consider both soft (i.e., continuous) and hard (i.e., discrete) connections between parallel encoder stacks, relying on a neural architecture search to find the best connection pattern in the hard case. We perform an extensive ablation study of the proposed MPT architecture and compare it with other state-of-the-art transformer architectures. Surprisingly, Base Transformer equipped with MPT can surpass the performance of Large Transformer on the challenging machine translation En-De and En-Fr datasets. In the hard connection case, the optimal connection pattern found for En-De also leads to improved performance for En-Fr.