泰米尔莫：泰米尔语的良好情绪检测数据集

论文标题

泰米尔莫：泰米尔语的良好情绪检测数据集

TamilEmo: Finegrained Emotion Detection Dataset for Tamil

论文作者

Vasantharajan, Charangan, Benhur, Sean, Kumarasen, Prasanna Kumar, Ponnusamy, Rahul, Thangasamy, Sathiyaraj, Priyadharshini, Ruba, Durairaj, Thenmozhi, Sivanraju, Kanchana, Sampath, Anbukkarasi, Chakravarthi, Bharathi Raja, McCrae, John Phillip

论文摘要

来自文本输入的情绪分析被认为是自然语言处理中的一项具有挑战性和有趣的任务。但是，由于缺乏低资源语言（即泰米尔语）的数据集，因此很难在该领域进行高标准的研究。因此，我们介绍了这个标记的数据集（最大的手动注释数据集，其中超过42k泰米尔YouTube评论，标记为31个情绪，包括中立，包括中性的情感）。该数据集的目的是改善泰米尔语多个下游任务中的情绪检测。我们还创建了三个不同的情绪分组（3级，7级和31级），并评估了模型在分组的每个类别上的性能。我们的Muril基本模型已在我们的3级组数据集中达到了0.60个宏平均F1分数。随机森林模型在7级和31级的组中，宏平均F1得分分别为0.42和0.29。

Emotional Analysis from textual input has been considered both a challenging and interesting task in Natural Language Processing. However, due to the lack of datasets in low-resource languages (i.e. Tamil), it is difficult to conduct research of high standard in this area. Therefore we introduce this labelled dataset (a largest manually annotated dataset of more than 42k Tamil YouTube comments, labelled for 31 emotions including neutral) for emotion recognition. The goal of this dataset is to improve emotion detection in multiple downstream tasks in Tamil. We have also created three different groupings of our emotions (3-class, 7-class and 31-class) and evaluated the model's performance on each category of the grouping. Our MURIL-base model has achieved a 0.60 macro average F1-score across our 3-class group dataset. With 7-class and 31-class groups, the Random Forest model performed well with a macro average F1-scores of 0.42 and 0.29 respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题