论文标题
野外阿拉伯语方言识别
Arabic Dialect Identification in the Wild
论文作者
论文摘要
我们介绍了卡迪(Qadi),这是一个自动收集的属于各种国家 /地区方言的推文数据集 - 中东和北非地区的18个不同国家。我们构建此数据集的方法依赖于应用多个过滤器来根据其帐户描述识别属于不同国家 /地区的用户,并消除用现代标准阿拉伯语编写的推文或包含不适当的语言。最终的数据集包含来自2,525个用户的540k推文,这些用户均匀分布在18个阿拉伯国家 /地区。使用固有评估,我们表明一组随机选择的推文的标签准确了91.5%。为了进行外部评估,我们能够在18个班级中的宏观平均得分为60.6%的推文上建立有效的国家级方言标识。
We present QADI, an automatically collected dataset of tweets belonging to a wide range of country-level Arabic dialects -covering 18 different countries in the Middle East and North Africa region. Our method for building this dataset relies on applying multiple filters to identify users who belong to different countries based on their account descriptions and to eliminate tweets that are either written in Modern Standard Arabic or contain inappropriate language. The resultant dataset contains 540k tweets from 2,525 users who are evenly distributed across 18 Arab countries. Using intrinsic evaluation, we show that the labels of a set of randomly selected tweets are 91.5% accurate. For extrinsic evaluation, we are able to build effective country-level dialect identification on tweets with a macro-averaged F1-score of 60.6% across 18 classes.