char2subword：使用可靠的字符组成扩展子词嵌入空间

论文标题

char2subword：使用可靠的字符组成扩展子词嵌入空间

Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality

论文作者

Aguilar, Gustavo, McCann, Bryan, Niu, Tong, Rajani, Nazneen, Keskar, Nitish, Solorio, Thamar

论文摘要

字节对编码（BPE）是语言模型的子单词令牌化过程中普遍存在的算法，因为它提供了多种好处。但是，此过程仅基于训练的数据统计数据，这使得令牌很难处理不经常的拼写。另一方面，尽管拼写错误，但纯粹的角色级模型通常会导致不合理的长序列，并使模型更难学习有意义的单词。为了减轻这些挑战，我们提出了一个基于字符的子字模块（CHAR2SUBWORD），该模块在伯特（Bert）等预训练模型中学习子字嵌入式表。我们的char2subword模块从子字词汇中的字符中构建了表示形式，并且可以用作子字嵌入表的倒入替换。该模块对角色级别的变化非常强大，例如拼写错误，单词拐点，套管和标点符号。我们通过预训练将其与BERT进一步集成在一起，同时保持BERT变压器参数固定，从而提供一种实用方法。最后，我们表明将我们的模块纳入Mbert可以显着提高社交媒体语言代码转换评估（Lince）基准的性能。

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the tokenizer to handle infrequent spellings. On the other hand, though robust to misspellings, pure character-level models often lead to unreasonably long sequences and make it harder for the model to learn meaningful words. To alleviate these challenges, we propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT. Our char2subword module builds representations from characters out of the subword vocabulary, and it can be used as a drop-in replacement of the subword embedding table. The module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation. We integrate it further with BERT through pre-training while keeping BERT transformer parameters fixed--and thus, providing a practical method. Finally, we show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题