论文标题
UHH-LT在Semeval-2020任务12:预先训练的变压器网络进行进攻性语言检测
UHH-LT at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer Networks for Offensive Language Detection
论文作者
论文摘要
预先训练的变压器网络(例如BERT产量的最先进的结果)为文本分类任务进行微调。通常,以监督方式在特定于任务的培训数据集上进行微调。还可以通过进一步预先培训蒙版的语言建模(MLM)任务以无监督的方式进行微调。因此,无监督的MLM的内域数据类似于实际分类目标数据集,允许模型的域适应。在本文中,我们比较了当前的预训练的变压器网络,而没有传输媒体进行攻击,以进行进攻性语言检测的表现。我们的MLM微调的Roberta分类器在2020年的Semeval 2020共享任务中正式排名第1英语。阿尔伯特模型的进一步实验甚至超过了这一结果。
Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks. Typically, fine-tuning is performed on task-specific training datasets in a supervised manner. One can also fine-tune in unsupervised manner beforehand by further pre-training the masked language modeling (MLM) task. Hereby, in-domain data for unsupervised MLM resembling the actual classification target dataset allows for domain adaptation of the model. In this paper, we compare current pre-trained transformer networks with and without MLM fine-tuning on their performance for offensive language detection. Our MLM fine-tuned RoBERTa-based classifier officially ranks 1st in the SemEval 2020 Shared Task~12 for the English language. Further experiments with the ALBERT model even surpass this result.