用乐高推出变压器：一项合成推理任务

论文标题

用乐高推出变压器：一项合成推理任务

Unveiling Transformers with LEGO: a synthetic reasoning task

论文作者

Zhang, Yi, Backurs, Arturs, Bubeck, Sébastien, Eldan, Ronen, Gunasekar, Suriya, Wagner, Tal

论文摘要

我们提出了一项综合推理任务，乐高（学习平等和小组操作），该任务封装了遵循推理链的问题，我们研究了变压器体系结构如何学习这项任务。我们特别注意数据效应，例如预处理（看似无关的NLP任务）和数据集组成（例如，训练和测试时间时的链长度不同），以及体系结构变体，例如重量绑定层或添加卷积组件。我们研究了受过训练的模型最终如何在任务中取得成功，尤其是我们设法了解一些关注点以及信息如何流动网络。特别是，我们已经确定了一种新颖的\ emph {Cossiot}模式，该模式仅适用于相同的令牌。基于这些观察结果，我们提出了一个假设，即由于某些结构化的注意模式，此处进行预训练有助于乐高任务，并且我们通过实验验证了这一假设。我们还观察到，在某些数据制度中，受过训练的变压器发现``快捷方式''解决方案遵循推理链，阻碍了模型的鲁棒性，而且我们提出了预防方法来防止它。由于我们对结构化注意力模式的发现，我们提出了乐高乐高积木模块的激励，我们提出了范围的范围，甚至可以替换了Vanilla的注意力。 \ emph {改善}模型在大规模预处理时的性能。

We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we manage to understand some of the attention heads as well as how the information flows in the network. In particular, we have identified a novel \emph{association} pattern that globally attends only to identical tokens. Based on these observations we propose a hypothesis that here pretraining helps for LEGO tasks due to certain structured attention patterns, and we experimentally verify this hypothesis. We also observe that in some data regime the trained transformer finds ``shortcut" solutions to follow the chain of reasoning, which impedes the model's robustness, and moreover we propose ways to prevent it. Motivated by our findings on structured attention patterns, we propose the LEGO attention module, a drop-in replacement for vanilla attention heads. This architectural change significantly reduces Flops and maintains or even \emph{improves} the model's performance at large-scale pretraining.

下载PDF全文

下载文献需遵守相关版权规定

论文标题