成本敏感的BERT用于通用句子分类，数据不平衡数据

论文标题

成本敏感的BERT用于通用句子分类，数据不平衡数据

Cost-Sensitive BERT for Generalisable Sentence Classification with Imbalanced Data

论文作者

Madabushi, Harish Tayyar, Kochkina, Elena, Castelle, Michael

论文摘要

近年来，由于新闻和消耗新闻的方式的技术和社会变化，对宣传的自动识别已获得重要意义。可以使用BERT有效地解决此任务，Bert是一种强大的新体系结构，可以对文本分类任务进行微调，这并不奇怪。但是，宣传检测就像其他涉及新闻文件和其他形式的脱皮社会交流（例如情感分析）的任务一样，固有地处理其类别同时不平衡和不同的数据。我们表明，伯特（Bert）虽然能够处理不平衡的类而没有其他数据增加的类别，但当培训和测试数据足够不同时，伯特（Bert）并不能很好地概括（就像新闻来源一样，随着时间的推移，其主题会随着时间的推移而发展）。我们通过提供数据集之间相似性的统计量度以及一种在培训和测试集不同时将成本加权纳入BERT的方法来解决该问题。我们在宣传技术（PTC）上测试这些方法，并在句子级宣传分类中获得第二高分。

The automatic identification of propaganda has gained significance in recent years due to technological and social changes in the way news is generated and consumed. That this task can be addressed effectively using BERT, a powerful new architecture which can be fine-tuned for text classification tasks, is not surprising. However, propaganda detection, like other tasks that deal with news documents and other forms of decontextualized social communication (e.g. sentiment analysis), inherently deals with data whose categories are simultaneously imbalanced and dissimilar. We show that BERT, while capable of handling imbalanced classes with no additional data augmentation, does not generalise well when the training and test data are sufficiently dissimilar (as is often the case with news sources, whose topics evolve over time). We show how to address this problem by providing a statistical measure of similarity between datasets and a method of incorporating cost-weighting into BERT when the training and test sets are dissimilar. We test these methods on the Propaganda Techniques Corpus (PTC) and achieve the second-highest score on sentence-level propaganda classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题