字节对编码是语言模型审计的次优

论文标题

字节对编码是语言模型审计的次优

Byte Pair Encoding is Suboptimal for Language Model Pretraining

论文作者

Bostrom, Kaj, Durrett, Greg

论文摘要

自然语言处理中预处理的变压器语言模型（LMS）的成功导致了广泛的预处理设置。特别是，这些模型采用了多种子字代币化方法，最著名的是字节对编码（BPE）（Sennrich等，2016; Gage，1994），WordPiece方法（Schuster和Nakajima，2012）和Unigram语言建模（Kudo，2018），以进行段文本。但是，据我们所知，文献并不包含对令牌化对语言模型预处理的影响的直接评估。我们分析了BPE和Umigram LM令牌化之间的差异，发现后一种方法恢复了与形态更紧密相吻合的子词单元，并避免了BPE贪婪的施工程序所引起的问题。然后，我们比较了通过这些象征化的相同变压器掩盖语言模型的微调任务性能。在下游任务和两种语言（英语和日语）中，我们发现Umigram LM令牌方法匹配或匹配BPE。我们希望，未来验证的LMS的开发人员将考虑对BPE采用ligram LM方法。

The success of pretrained transformer language models (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994), the WordPiece method (Schuster and Nakajima, 2012), and unigram language modeling (Kudo, 2018), to segment text. However, to the best of our knowledge, the literature does not contain a direct evaluation of the impact of tokenization on language model pretraining. We analyze differences between BPE and unigram LM tokenization, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE's greedy construction procedure. We then compare the fine-tuned task performance of identical transformer masked language models pretrained with these tokenizations. Across downstream tasks and two languages (English and Japanese), we find that the unigram LM tokenization method matches or outperforms BPE. We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题