使用无监督的机器翻译的数据增强改善了跨语性单词嵌入的结构相似性

论文标题

使用无监督的机器翻译的数据增强改善了跨语性单词嵌入的结构相似性

Data Augmentation with Unsupervised Machine Translation Improves the Structural Similarity of Cross-lingual Word Embeddings

论文作者

Nishikawa, Sosuke, Ri, Ryokan, Tsuruoka, Yoshimasa

论文摘要

无监督的跨语性单词嵌入（CLWE）方法学习了一个线性转换矩阵，该矩阵映射了两个单语言嵌入空间，这些空间单独训练了单语言语料库。该方法依赖于两个嵌入空间在结构上相似的假设，这通常不一定是正确的。在本文中，我们认为，使用无监督的机器翻译模型产生的伪平行语料库促进了两个嵌入空间的结构相似性，并提高了无监督的映射方法中CLWES的质量。我们表明，我们的方法在给定数量相同的数据的情况下优于其他替代方法，并且通过详细分析，我们表明，来自无监督的机器翻译的伪数据数据的数据增加对于基于映射的CLWES特别有效，因为（1）伪数据使源和目标公司（部分）平行于源和目标公司；（2）伪数据包含有关原始语言的信息，这些信息有助于学习源和目标语言之间的类似嵌入空间。

Unsupervised cross-lingual word embedding (CLWE) methods learn a linear transformation matrix that maps two monolingual embedding spaces that are separately trained with monolingual corpora. This method relies on the assumption that the two embedding spaces are structurally similar, which does not necessarily hold true in general. In this paper, we argue that using a pseudo-parallel corpus generated by an unsupervised machine translation model facilitates the structural similarity of the two embedding spaces and improves the quality of CLWEs in the unsupervised mapping method. We show that our approach outperforms other alternative approaches given the same amount of data, and, through detailed analysis, we show that data augmentation with the pseudo data from unsupervised machine translation is especially effective for mapping-based CLWEs because (1) the pseudo data makes the source and target corpora (partially) parallel; (2) the pseudo data contains information on the original language that helps to learn similar embedding spaces between the source and target languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题