具有可学习激活功能的变压器

论文标题

具有可学习激活功能的变压器

Transformers with Learnable Activation Functions

论文作者

Fang, Haishuo, Lee, Ji-Ung, Moosavi, Nafise Sadat, Gurevych, Iryna

论文摘要

激活功能可以对降低输入数据的拓扑复杂性产生重大影响，从而提高模型的性能。选择合适的激活功能是神经模型设计中的重要步骤。但是，在基于变压器的语言模型中很少讨论或探索激活功能的选择。事先选择它们的激活功能，然后从预训练中固定到微调。结果，在这个漫长的生命周期中，无法调整它们对模型的电感偏见。此外，随后开发的模型（例如Roberta，Bart和GPT-3）经常跟进先前的工作（例如BERT），以使用相同的激活函数而无需合理。在本文中，我们研究了变压器体系结构中使用理性激活函数（RAF）（RAF）的有效性。与常规，预定义的激活功能相反，RAF可以根据输入数据自适应地学习最佳激活功能。我们的实验表明，基于RAF的变压器（RAFT）比具有GELU函数的香草BERT的验证性更低。我们进一步评估了低和全数据设置中下游任务的筏。我们的结果表明，筏在大多数任务和设置上都优于对应模型。例如，在低DATA方案（可以使用100个训练示例），在胶水基准下，木筏在胶水基准上的表现平均得分为5.71点，在全DATA设置的小队中，平均得分为2.05分。对学到的RAF的形状的分析进一步揭示了它们在预训练模型的不同层之间有很大的变化，并且看起来与常规激活函数大多不同。 RAFT为根据学习的激活功能打开了一个新的研究方向，用于分析和解释预训练的模型。

Activation functions can have a significant impact on reducing the topological complexity of input data and therefore improve the performance of the model. Selecting a suitable activation function is an essential step in neural model design. However, the choice of activation function is seldom discussed or explored in Transformer-based language models. Their activation functions are chosen beforehand and then remain fixed from pre-training to fine-tuning. As a result, the inductive biases they imposed on models cannot be adjusted during this long life cycle. Moreover, subsequently developed models (e.g., RoBERTa, BART, and GPT-3) often follow up prior work (e.g., BERT) to use the same activation function without justification. In this paper, we investigate the effectiveness of using Rational Activation Function (RAF), a learnable activation function, in the Transformer architecture. In contrast to conventional, predefined activation functions, RAFs can adaptively learn optimal activation functions during training according to input data. Our experiments show the RAF-based Transformer (RAFT) achieves a lower validation perplexity than a vanilla BERT with the GELU function. We further evaluate RAFT on downstream tasks in low- and full-data settings. Our results show that RAFT outperforms the counterpart model across the majority of tasks and settings. For instance, RAFT outperforms vanilla BERT on the GLUE benchmark by 5.71 points on average in low-data scenario (where 100 training examples are available) and by 2.05 points on SQuAD in full-data setting. Analysis of the shapes of learned RAFs further unveils that they substantially vary between different layers of the pre-trained model and mostly look very different from conventional activation functions. RAFT opens a new research direction for analyzing and interpreting pre-trained models according to the learned activation functions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题