论文标题
char2subword:使用可靠的字符组成扩展子词嵌入空间
Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality
论文作者
论文摘要
字节对编码(BPE)是语言模型的子单词令牌化过程中普遍存在的算法,因为它提供了多种好处。但是,此过程仅基于训练的数据统计数据,这使得令牌很难处理不经常的拼写。另一方面,尽管拼写错误,但纯粹的角色级模型通常会导致不合理的长序列,并使模型更难学习有意义的单词。为了减轻这些挑战,我们提出了一个基于字符的子字模块(CHAR2SUBWORD),该模块在伯特(Bert)等预训练模型中学习子字嵌入式表。我们的char2subword模块从子字词汇中的字符中构建了表示形式,并且可以用作子字嵌入表的倒入替换。该模块对角色级别的变化非常强大,例如拼写错误,单词拐点,套管和标点符号。我们通过预训练将其与BERT进一步集成在一起,同时保持BERT变压器参数固定,从而提供一种实用方法。最后,我们表明将我们的模块纳入Mbert可以显着提高社交媒体语言代码转换评估(Lince)基准的性能。
Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the tokenizer to handle infrequent spellings. On the other hand, though robust to misspellings, pure character-level models often lead to unreasonably long sequences and make it harder for the model to learn meaningful words. To alleviate these challenges, we propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT. Our char2subword module builds representations from characters out of the subword vocabulary, and it can be used as a drop-in replacement of the subword embedding table. The module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation. We integrate it further with BERT through pre-training while keeping BERT transformer parameters fixed--and thus, providing a practical method. Finally, we show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.