论文标题

约鲁巴嵌入中的变量挑战的挑战

The Challenge of Diacritics in Yoruba Embeddings

论文作者

Adewumi, Tosin P., Liwicki, Foteini, Liwicki, Marcus

论文摘要

这项工作的主要贡献包括实证建立了来自无标准(归一化)数据集的约鲁巴嵌入的更好性能以及提供新的类比集以进行评估。 Yoruba语言是一种音调语言,以书面形式利用变音符号(音调标记)。我们表明,这会通过从完全相同的Wikipedia数据集中创建嵌入方式来影响嵌入性能,但第二个数据集进行了归一化的嵌入性能。我们进一步将平均内在性能与另外两项工作(使用类比测试集和Wordsim)进行了比较,并在WordsIM和相应的Spearman相关性中获得了最佳性能。

The major contributions of this work include the empirical establishment of a better performance for Yoruba embeddings from undiacritized (normalized) dataset and provision of new analogy sets for evaluation. The Yoruba language, being a tonal language, utilizes diacritics (tonal marks) in written form. We show that this affects embedding performance by creating embeddings from exactly the same Wikipedia dataset but with the second one normalized to be undiacritized. We further compare average intrinsic performance with two other work (using analogy test set & WordSim) and we obtain the best performance in WordSim and corresponding Spearman correlation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源