马拉地语的Twitter Bert方法进行进攻性语言检测

论文标题

马拉地语的Twitter Bert方法进行进攻性语言检测

A Twitter BERT Approach for Offensive Language Detection in Marathi

论文作者

Chavan, Tanmay, Patankar, Shantanu, Kane, Aditya, Gokhale, Omkar, Joshi, Raviraj

论文摘要

自动进攻性语言检测对于打击仇恨言论的传播至关重要，尤其是在社交媒体中。本文介绍了我们在低资源形式语言马拉地语中关于进攻性语言识别的工作。该问题被提出为文本分类任务，以将推文识别为令人反感的或非犯罪的推文。我们在此分类任务上评估了不同的单语言和多语言BERT模型，重点是通过社交媒体数据集预先培训的BERT模型。我们比较了Muril，Mahatweetbert，Mahatweetbert Hateful和Mahabert的表现。我们还探索了其他现有的Marathi Hate Speak语料库HASOC 2021和L3Cube-Mahahate的外部数据增强。 Mahatweetbert是BERT模型，在合并数据集（HASOC 2021 + HASOC 2022 + MAHAHATE）上进行微调时，在马拉地语中进行了预先训练，在Hasoc 2022测试集上的F1分数为98.43。这样，我们还为HASOC 2022 /模具V2测试集提供了新的最新结果。

Automated offensive language detection is essential in combating the spread of hate speech, particularly in social media. This paper describes our work on Offensive Language Identification in low resource Indic language Marathi. The problem is formulated as a text classification task to identify a tweet as offensive or non-offensive. We evaluate different mono-lingual and multi-lingual BERT models on this classification task, focusing on BERT models pre-trained with social media datasets. We compare the performance of MuRIL, MahaTweetBERT, MahaTweetBERT-Hateful, and MahaBERT on the HASOC 2022 test set. We also explore external data augmentation from other existing Marathi hate speech corpus HASOC 2021 and L3Cube-MahaHate. The MahaTweetBERT, a BERT model, pre-trained on Marathi tweets when fine-tuned on the combined dataset (HASOC 2021 + HASOC 2022 + MahaHate), outperforms all models with an F1 score of 98.43 on the HASOC 2022 test set. With this, we also provide a new state-of-the-art result on HASOC 2022 / MOLD v2 test set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题