令牌对他们的角色有什么了解？他们怎么知道？

论文标题

令牌对他们的角色有什么了解？他们怎么知道？

What do tokens know about their characters and how do they know it?

论文作者

Kaushal, Ayush, Mahowald, Kyle

论文摘要

尽管缺乏对令牌字符组成的明确访问，但使用子词令牌化方案的预训练的语言模型（PLM）仍可以在需要角色级信息的各种语言任务中取得成功。 Here, studying a range of models (e.g., GPT- J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for "cat" encodes that it contains the character "a").我们发现这些模型可靠地编码字符级信息，并且一般而言，较大的模型在任务中的表现更好。我们表明，这些结果概括为非拉丁语字母（阿拉伯语，Devanagari和Cyrillic）的字符。然后，通过一系列实验和分析，我们研究了PLM在训练过程中获取英语特征信息的机制，并认为这些知识是通过多种现象获得的，包括特定字符与特定角色与语音之间的系统关系，以及相关字符串的象征化的自然变异性。

Pre-trained language models (PLMs) that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT- J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for "cat" encodes that it contains the character "a"). We find that these models robustly encode character-level information and, in general, larger models perform better at the task. We show that these results generalize to characters from non-Latin alphabets (Arabic, Devanagari, and Cyrillic). Then, through a series of experiments and analyses, we investigate the mechanisms through which PLMs acquire English-language character information during training and argue that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题