论文标题
更多的数据更好吗?通过基于变形金刚的主动学习,重新思考效率检测效率的重要性
Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning
论文作者
论文摘要
注释滥用语言很昂贵,在后勤上复杂,并造成了心理伤害的风险。但是,大多数机器学习研究都优先提高有效性(即F1或精度得分),而不是数据效率(即,最小化注释的数据量)。在本文中,我们在两个数据集上使用模拟实验,以不同比例的滥用,以证明基于变形金刚的主动学习是一种有希望的方法,可以实质上提高效率,同时仍保持高效率,尤其是当虐待内容是数据集的较小比例时。这种方法需要一小部分标记数据,以达到与完整数据集培训相当的性能。
Annotating abusive language is expensive, logistically complex and creates a risk of psychological harm. However, most machine learning research has prioritized maximizing effectiveness (i.e., F1 or accuracy score) rather than data efficiency (i.e., minimizing the amount of data that is annotated). In this paper, we use simulated experiments over two datasets at varying percentages of abuse to demonstrate that transformers-based active learning is a promising approach to substantially raise efficiency whilst still maintaining high effectiveness, especially when abusive content is a smaller percentage of the dataset. This approach requires a fraction of labeled data to reach performance equivalent to training over the full dataset.