论文标题
通过掩码提案网络预处理方差的语言降低语言
Variance-reduced Language Pretraining via a Mask Proposal Network
论文作者
论文摘要
自我监督的学习,也就是预处理,在自然语言处理中很重要。大多数预处理的方法首先在句子中随机掩盖某些位置,然后训练模型以在蒙版位置恢复令牌。以这种方式,可以在没有人类标签的情况下训练该模型,并且可以将大量数据与数十亿个参数一起使用。因此,优化效率变得至关重要。在本文中,我们从降低梯度差异的角度解决了问题。特别是,我们首先提出了一个原则性的梯度方差分解定理,该定理表明语言预处理的随机梯度的方差可以自然分解为两个术语:批次中的数据样本以及掩模的采样引起的方差。第二个学期是自学学习和监督学习之间的关键区别,这使得训练较慢。为了减少第二部分的差异,我们利用了重要性抽样策略,该策略旨在根据提案分布而不是统一分布对口罩进行采样。可以表明,如果提案分布与梯度标准成正比,则采样的方差会降低。为了提高效率,我们引入了一个蒙版提案网络(MAPNET),该网络近似于最佳的掩码提案分布,并与模型一起端到端训练。根据实验结果,我们的模型比基线BERT模型更快地收敛和更高的性能。
Self-supervised learning, a.k.a., pretraining, is important in natural language processing. Most of the pretraining methods first randomly mask some positions in a sentence and then train a model to recover the tokens at the masked positions. In such a way, the model can be trained without human labeling, and the massive data can be used with billion parameters. Therefore, the optimization efficiency becomes critical. In this paper, we tackle the problem from the view of gradient variance reduction. In particular, we first propose a principled gradient variance decomposition theorem, which shows that the variance of the stochastic gradient of the language pretraining can be naturally decomposed into two terms: the variance that arises from the sample of data in a batch, and the variance that arises from the sampling of the mask. The second term is the key difference between selfsupervised learning and supervised learning, which makes the pretraining slower. In order to reduce the variance of the second part, we leverage the importance sampling strategy, which aims at sampling the masks according to a proposal distribution instead of the uniform distribution. It can be shown that if the proposal distribution is proportional to the gradient norm, the variance of the sampling is reduced. To improve efficiency, we introduced a MAsk Proposal Network (MAPNet), which approximates the optimal mask proposal distribution and is trained end-to-end along with the model. According to the experimental result, our model converges much faster and achieves higher performance than the baseline BERT model.