论文标题

单词嵌入:稳定性和语义变化

Word Embeddings: Stability and Semantic Change

论文作者

Rettenmeier, Lucas

论文摘要

单词嵌入是由自然语言处理(NLP)中的一类技术计算的,这些技术从大型文本语料库中创建语言的连续矢量表示。大多数嵌入技术的训练过程的随机性质可能导致出人意料的强大不稳定性,即随后两次将相同的技术应用于相同的数据,可以产生完全不同的结果。在这项工作中,我们介绍了一项实验研究,该研究对过去十年中三种最有影响力的嵌入技术的训练过程的不稳定性:Word2Vec,Glove和FastText。基于实验结果,我们提出了一个统计模型来描述嵌入技术的不稳定性,并引入了一种新颖的指标来衡量单个单词表示的不稳定性。最后,我们提出了一种方法来最大程度地减少不稳定性的方法 - 通过计算多个运行中修改的平均值 - 并将其应用于特定的语言问题:语义变化的检测和量化,即测量随时间时间的含义和用法的变化。

Word embeddings are computed by a class of techniques within natural language processing (NLP), that create continuous vector representations of words in a language from a large text corpus. The stochastic nature of the training process of most embedding techniques can lead to surprisingly strong instability, i.e. subsequently applying the same technique to the same data twice, can produce entirely different results. In this work, we present an experimental study on the instability of the training process of three of the most influential embedding techniques of the last decade: word2vec, GloVe and fastText. Based on the experimental results, we propose a statistical model to describe the instability of embedding techniques and introduce a novel metric to measure the instability of the representation of an individual word. Finally, we propose a method to minimize the instability - by computing a modified average over multiple runs - and apply it to a specific linguistic problem: The detection and quantification of semantic change, i.e. measuring changes in the meaning and usage of words over time.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源