有点很长的路要走：尽管数据稀缺，改善有毒语言分类

论文标题

有点很长的路要走：尽管数据稀缺，改善有毒语言分类

A little goes a long way: Improving toxic language classification despite data scarcity

论文作者

Juuti, Mika, Gröndahl, Tommi, Flanagan, Adrian, Asokan, N.

论文摘要

对某些类型的有毒语言的检测受到了标记的培训数据的极度缺乏。数据增强 - 从标记的种子数据集生成新的合成数据 - 可以帮助您。数据增强对有毒语言分类的功效尚未得到充分探讨。我们介绍了有关数据增强技术如何影响跨有毒语言分类器的性能的第一项系统研究，从浅层逻辑回归体系结构到BERT - 最先进的预训练的预训练的变压器网络。我们比较了非常稀缺的种子数据集上八种技术的性能。我们表明，尽管伯特（Bert）执行了最佳的，浅层分类器在对数据增强的培训中进行了相当相似的效果，包括三种技术（包括GPT-2生成的句子）。我们讨论了性能和计算开销的相互作用，这可以告知不同约束下的技术的选择。

Detection of some types of toxic language is hampered by extreme scarcity of labeled training data. Data augmentation - generating new synthetic data from a labeled seed dataset - can help. The efficacy of data augmentation on toxic language classification has not been fully explored. We present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers, ranging from shallow logistic regression architectures to BERT - a state-of-the-art pre-trained Transformer network. We compare the performance of eight techniques on very scarce seed datasets. We show that while BERT performed the best, shallow classifiers performed comparably when trained on data augmented with a combination of three techniques, including GPT-2-generated sentences. We discuss the interplay of performance and computational overhead, which can inform the choice of techniques under different constraints.

下载PDF全文

下载文献需遵守相关版权规定

论文标题