没有文字的预处理：学习数百万个单词的词汇

论文标题

没有文字的预处理：学习数百万个单词的词汇

Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

论文作者

Feng, Zhangyin, Tang, Duyu, Zhou, Cong, Liao, Junwei, Wu, Shuangzhi, Feng, Xiaocheng, Qin, Bing, Cao, Yunbo, Shi, Shuming

论文摘要

标准BERT采用基于子字的令牌化，这可能会将一个单词分为两个或多个单词（例如，将“无损”转换为“损失”和“少”）。这将给以下情况带来不便：（1）获得分为多个文字的单词的上下文向量的最佳方法是什么？（2）如何通过披肩测试预测一个单词，而又不知道提前的单词数量？在这项工作中，我们探讨了通过单词而不是单词词汇而不是单词来开发伯特风格的验证模型的可能性。我们将这样的单词级Bert模型称为Wordbert。我们训练具有不同词汇大小，初始化配置和语言的模型。结果表明，与基于标准文字的Bert相比，Wordbert在披肩测试和机器阅读理解方面做出了重大改进。在许多其他自然语言理解任务（包括POS标签，分块和NER）上，Wordbert的表现始终比Bert更好。模型分析表明，Wordbert对Bert的主要优势在于对低频单词和稀有单词的理解。此外，由于管道是与语言无关的，因此我们培训Wordbert的中文，并在五个自然语言理解数据集上获得可观的收益。最后，对推理速度的分析说明了Wordbert在自然语言理解任务中的BERT具有可比的时间成本。

The standard BERT adopts subword-based tokenization, which may break a word into two or more wordpieces (e.g., converting "lossless" to "loss" and "less"). This will bring inconvenience in following situations: (1) what is the best way to obtain the contextual vector of a word that is divided into multiple wordpieces? (2) how to predict a word via cloze test without knowing the number of wordpieces in advance? In this work, we explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces. We call such word-level BERT model as WordBERT. We train models with different vocabulary sizes, initialization configurations and languages. Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension. On many other natural language understanding tasks, including POS tagging, chunking and NER, WordBERT consistently performs better than BERT. Model analysis indicates that the major advantage of WordBERT over BERT lies in the understanding for low-frequency words and rare words. Furthermore, since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets. Lastly, the analyse on inference speed illustrates WordBERT has comparable time cost to BERT in natural language understanding tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题