多级文本分类的合成过采样方法的比较

论文标题

多级文本分类的合成过采样方法的比较

A Comparison of Synthetic Oversampling Methods for Multi-class Text Classification

论文作者

Glazkova, Anna

论文摘要

作者比较了多级主题分类问题的过采样方法。 SMOTE算法是最受欢迎的过采样方法之一。它包括选择两个少数班级的示例并基于它们生成新示例。在本文中，作者将基本的Smote方法与其两种修改（边界SMOTE和ADASYN）和随机过采样技术进行了比较，并在文本分类任务之一的示例中进行了比较。该论文讨论了K-Nearest邻居算法，支持向量机算法和三种类型的神经网络（前馈网络，长期记忆（LSTM）和双向LSTM）。作者将这些机器学习算法与不同的文本表示结合在一起，并比较了合成的过采样方法。在大多数情况下，使用过采样技术可以显着提高分类质量。作者得出的结论是，对于这项任务，KNN和SVM算法的质量比神经网络更受阶级失衡的影响。

The authors compared oversampling methods for the problem of multi-class topic classification. The SMOTE algorithm underlies one of the most popular oversampling methods. It consists in choosing two examples of a minority class and generating a new example based on them. In the paper, the authors compared the basic SMOTE method with its two modifications (Borderline SMOTE and ADASYN) and random oversampling technique on the example of one of text classification tasks. The paper discusses the k-nearest neighbor algorithm, the support vector machine algorithm and three types of neural networks (feedforward network, long short-term memory (LSTM) and bidirectional LSTM). The authors combine these machine learning algorithms with different text representations and compared synthetic oversampling methods. In most cases, the use of oversampling techniques can significantly improve the quality of classification. The authors conclude that for this task, the quality of the KNN and SVM algorithms is more influenced by class imbalance than neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题