通过全面探索的蒙版语言模型改善自我监督的预训练

论文标题

通过全面探索的蒙版语言模型改善自我监督的预训练

Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model

论文作者

Zheng, Mingzhi, Shen, Dinghan, Shen, Yelong, Chen, Weizhu, Xiao, Lin

论文摘要

蒙版语言模型（MLM）框架已被广泛用于自我监督的语言预训练。在本文中，我们认为在MLM中随机采样面具将导致不太可能的梯度差异。因此，我们从理论上通过将梯度协方差与两个不同掩码之间的锤击距离相关联（给定特定的文本序列），从而量化了梯度方差。为了减少由于蒙版的采样而引起的差异，我们提出了一种全面探索的掩蔽策略，其中文本序列被分为一定数量的非重叠段。此后，将一个细分市场内的令牌掩盖进行培训。从理论的角度来看，我们证明了从这种新的掩蔽架上得出的梯度具有较小的差异，并且可以导致更有效的自我监督训练。我们对从头开始进行持续的预训练和一般预训练进行了广泛的实验。经验结果证实，这种新的掩蔽策略可以始终超过标准随机掩蔽。详细的效率分析和消融研究进一步验证了我们在MLM框架下进行全面探索的掩盖策略的优势。

Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training. In this paper, we argue that randomly sampled masks in MLM would lead to undesirably large gradient variance. Thus, we theoretically quantify the gradient variance via correlating the gradient covariance with the Hamming distance between two different masks (given a certain text sequence). To reduce the variance due to the sampling of masks, we propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments. Thereafter, the tokens within one segment are masked for training. We prove, from a theoretical perspective, that the gradients derived from this new masking schema have a smaller variance and can lead to more efficient self-supervised training. We conduct extensive experiments on both continual pre-training and general pre-training from scratch. Empirical results confirm that this new masking strategy can consistently outperform standard random masking. Detailed efficiency analysis and ablation studies further validate the advantages of our fully-explored masking strategy under the MLM framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题