UHH-LT在Semeval-2020任务12：预先训练的变压器网络进行进攻性语言检测

论文标题

UHH-LT在Semeval-2020任务12：预先训练的变压器网络进行进攻性语言检测

UHH-LT at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer Networks for Offensive Language Detection

论文作者

Wiedemann, Gregor, Yimam, Seid Muhie, Biemann, Chris

论文摘要

预先训练的变压器网络（例如BERT产量的最先进的结果）为文本分类任务进行微调。通常，以监督方式在特定于任务的培训数据集上进行微调。还可以通过进一步预先培训蒙版的语言建模（MLM）任务以无监督的方式进行微调。因此，无监督的MLM的内域数据类似于实际分类目标数据集，允许模型的域适应。在本文中，我们比较了当前的预训练的变压器网络，而没有传输媒体进行攻击，以进行进攻性语言检测的表现。我们的MLM微调的Roberta分类器在2020年的Semeval 2020共享任务中正式排名第1英语。阿尔伯特模型的进一步实验甚至超过了这一结果。

Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks. Typically, fine-tuning is performed on task-specific training datasets in a supervised manner. One can also fine-tune in unsupervised manner beforehand by further pre-training the masked language modeling (MLM) task. Hereby, in-domain data for unsupervised MLM resembling the actual classification target dataset allows for domain adaptation of the model. In this paper, we compare current pre-trained transformer networks with and without MLM fine-tuning on their performance for offensive language detection. Our MLM fine-tuned RoBERTa-based classifier officially ranks 1st in the SemEval 2020 Shared Task~12 for the English language. Further experiments with the ALBERT model even surpass this result.

下载PDF全文

下载文献需遵守相关版权规定

论文标题