论文标题
Dialex:用于评估多基于阿拉伯语单词嵌入的基准
DiaLex: A Benchmark for Evaluating Multidialectal Arabic Word Embeddings
论文作者
论文摘要
单词嵌入是现代自然语言处理系统的核心组成部分,使能够彻底评估它们是至关重要的任务。我们描述了Dialex,这是对方言阿拉伯语嵌入的内在评估的基准。 Dialex涵盖了五种重要的阿拉伯方言:阿尔及利亚,埃及人,黎巴嫩,叙利亚和突尼斯人。在这些方言中,Dialex为六个句法和语义关系提供了一种测试库,即男性至女性,至偶至双重,奇异到复数,反义词,比较,对过去时态的词。因此,焦点由一对词对组成,代表五个方言中的六个关系中的每个关系中的每个关系。为了证明Dialex的效用,我们使用它来评估我们开发的一组现有的阿拉伯语嵌入。我们的基准,评估代码和新单词嵌入模型将公开可用。
Word embeddings are a core component of modern natural language processing systems, making the ability to thoroughly evaluate them a vital task. We describe DiaLex, a benchmark for intrinsic evaluation of dialectal Arabic word embedding. DiaLex covers five important Arabic dialects: Algerian, Egyptian, Lebanese, Syrian, and Tunisian. Across these dialects, DiaLex provides a testbank for six syntactic and semantic relations, namely male to female, singular to dual, singular to plural, antonym, comparative, and genitive to past tense. DiaLex thus consists of a collection of word pairs representing each of the six relations in each of the five dialects. To demonstrate the utility of DiaLex, we use it to evaluate a set of existing and new Arabic word embeddings that we developed. Our benchmark, evaluation code, and new word embedding models will be publicly available.