论文标题

令牌对他们的角色有什么了解?他们怎么知道?

What do tokens know about their characters and how do they know it?

论文作者

Kaushal, Ayush, Mahowald, Kyle

论文摘要

尽管缺乏对令牌字符组成的明确访问,但使用子词令牌化方案的预训练的语言模型(PLM)仍可以在需要角色级信息的各种语言任务中取得成功。 Here, studying a range of models (e.g., GPT- J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for "cat" encodes that it contains the character "a").我们发现这些模型可靠地编码字符级信息,并且一般而言,较大的模型在任务中的表现更好。我们表明,这些结果概括为非拉丁语字母(阿拉伯语,Devanagari和Cyrillic)的字符。然后,通过一系列实验和分析,我们研究了PLM在训练过程中获取英语特征信息的机制,并认为这些知识是通过多种现象获得的,包括特定字符与特定角色与语音之间的系统关系,以及相关字符串的象征化的自然变异性。

Pre-trained language models (PLMs) that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT- J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifiers to predict the presence or absence of a particular alphabetical character in a token, based on its embedding (e.g., probing whether the model embedding for "cat" encodes that it contains the character "a"). We find that these models robustly encode character-level information and, in general, larger models perform better at the task. We show that these results generalize to characters from non-Latin alphabets (Arabic, Devanagari, and Cyrillic). Then, through a series of experiments and analyses, we investigate the mechanisms through which PLMs acquire English-language character information during training and argue that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源