论文标题

单词旋转器的距离

Word Rotator's Distance

论文作者

Yokoi, Sho, Takahashi, Ryo, Akama, Reina, Suzuki, Jun, Inui, Kentaro

论文摘要

评估文本相似性的一个关键原则是通过考虑对准一词来衡量两个文本之间的语义重叠程度。这种基于一致性的方法是直观和可解释的。但是,它们在经验上不如通用句子向量之间的简单余弦相似性。为了解决这个问题,我们关注并证明了一个事实,即单词向量的规范是对单词重要性的良好代理,它们的角度是单词相似性的良好代理。基于对齐的方法并不能区分它们,而句子矢量方法会自动将标准用作重要性一词。因此,我们提出了一种方法,该方法首先将单词向量分解为其规范和方向,然后使用Earth Mover的距离(即最佳运输成本)计算基于对齐的相似性,我们将其称为单词旋转器的距离。此外,我们找到了如何发展单词向量(向量转换器)的规范和方向,这是一种从句子 - 矢量估计方法得出的新系统方法。在几个文本相似性数据集上,这些简单提出的方法的组合不仅优于基于对准的方法,而且优于强大的基线。源代码可从https://github.com/eumesy/wrd获得

A key principle in assessing textual similarity is measuring the degree of semantic overlap between two texts by considering the word alignment. Such alignment-based approaches are intuitive and interpretable; however, they are empirically inferior to the simple cosine similarity between general-purpose sentence vectors. To address this issue, we focus on and demonstrate the fact that the norm of word vectors is a good proxy for word importance, and their angle is a good proxy for word similarity. Alignment-based approaches do not distinguish them, whereas sentence-vector approaches automatically use the norm as the word importance. Accordingly, we propose a method that first decouples word vectors into their norm and direction, and then computes alignment-based similarity using earth mover's distance (i.e., optimal transport cost), which we refer to as word rotator's distance. Besides, we find how to grow the norm and direction of word vectors (vector converter), which is a new systematic approach derived from sentence-vector estimation methods. On several textual similarity datasets, the combination of these simple proposed methods outperformed not only alignment-based approaches but also strong baselines. The source code is available at https://github.com/eumesy/wrd

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源