论文标题
都是好词向量空间同构吗?
Are All Good Word Vector Spaces Isomorphic?
论文作者
论文摘要
对齐跨语性单词向量空间的现有算法假定向量空间大致是同构的。结果,它们在非同形空间上的表现差或完全失败。这种非同态性是由于语言之间的类型学差异而导致的。在这项工作中,我们询问非同态是否也是堕落单词矢量空间的迹象。我们提出了一系列跨不同语言的实验,这些实验表明,语言对的性能差异不仅是由于类型学差异所致,而且主要归因于可用的单语资源的大小,以及单语培训的属性和持续时间(例如,“不足”)。
Existing algorithms for aligning cross-lingual word vector spaces assume that vector spaces are approximately isomorphic. As a result, they perform poorly or fail completely on non-isomorphic spaces. Such non-isomorphism has been hypothesised to result from typological differences between languages. In this work, we ask whether non-isomorphism is also crucially a sign of degenerate word vector spaces. We present a series of experiments across diverse languages which show that variance in performance across language pairs is not only due to typological differences, but can mostly be attributed to the size of the monolingual resources available, and to the properties and duration of monolingual training (e.g. "under-training").