论文标题

多语言立场检测:加泰罗尼亚独立语料库

Multilingual Stance Detection: The Catalonia Independence Corpus

论文作者

Zotova, Elena, Agerri, Rodrigo, Nuñez, Manuel, Rigau, German

论文摘要

立场检测旨在确定给定文本对特定主题或主张的态度。尽管在过去几年中,姿态检测经过了相当的研究,但大多数工作都集中在英语上。这主要是由于其他语言中相对缺乏带注释的数据。 Ibereval 2018上发布的TW-10全民公决数据集是在加泰罗尼亚和西班牙语中提供多语言姿态宣传数据的努力。不幸的是,TW-10加泰罗尼亚子集的不平衡。本文通过在Twitter中针对加泰罗尼亚语和西班牙语言提出一个新的多语言数据集来解决这些问题,目的是促进对多语言和跨语言环境中立场检测的研究。数据集对一个主题的立场注释,即加泰罗尼亚的独立性。我们还提供了一种半自动方法来根据Twitter用户的分类来注释数据集。我们通过多种有监督的方法在新的语料库上进行实验,包括线性分类器和深度学习方法。将我们的新语料库与TW-1O数据集进行比较,既表明了平衡语料库对立场检测的多语言和跨语性研究的好处和潜力。最后,我们在Catalan和Spanish的TW-10数据集上建立了新的最新结果。

Stance detection aims to determine the attitude of a given text with respect to a specific topic or claim. While stance detection has been fairly well researched in the last years, most the work has been focused on English. This is mainly due to the relative lack of annotated data in other languages. The TW-10 Referendum Dataset released at IberEval 2018 is a previous effort to provide multilingual stance-annotated data in Catalan and Spanish. Unfortunately, the TW-10 Catalan subset is extremely imbalanced. This paper addresses these issues by presenting a new multilingual dataset for stance detection in Twitter for the Catalan and Spanish languages, with the aim of facilitating research on stance detection in multilingual and cross-lingual settings. The dataset is annotated with stance towards one topic, namely, the independence of Catalonia. We also provide a semi-automatic method to annotate the dataset based on a categorization of Twitter users. We experiment on the new corpus with a number of supervised approaches, including linear classifiers and deep learning methods. Comparison of our new corpus with the with the TW-1O dataset shows both the benefits and potential of a well balanced corpus for multilingual and cross-lingual research on stance detection. Finally, we establish new state-of-the-art results on the TW-10 dataset, both for Catalan and Spanish.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源