论文标题
对不同采样技术对分类不平衡分类的疗效的经验分析
An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification
论文作者
论文摘要
从不平衡数据中学习是一项具有挑战性的任务。在进行不平衡数据训练时,标准分类算法的性能往往差。需要通过修改数据分布或重新设计基础分类算法以实现理想的性能来采用一些特殊的策略。现实世界中数据集中不平衡的流行率导致为班级不平衡问题创造了多种策略。但是,并非所有策略在不同的不平衡情况下都有用或提供良好的性能。处理不平衡的数据有许多方法,但是尚未进行此类技术的功效或实验比较。在这项研究中,我们对26种流行抽样技术进行了全面分析,以了解它们在处理不平衡数据时的有效性。在50个数据集上进行了严格的实验,这些数据集具有不同程度的失衡,以彻底研究这些技术的性能。已经提出了对技术的优势和局限性的详细讨论,以及如何克服此类局限性。我们确定了影响采样策略的一些关键因素,并提供有关如何为特定应用选择合适的抽样技术的建议。
Learning from imbalanced data is a challenging task. Standard classification algorithms tend to perform poorly when trained on imbalanced data. Some special strategies need to be adopted, either by modifying the data distribution or by redesigning the underlying classification algorithm to achieve desirable performance. The prevalence of imbalance in real-world datasets has led to the creation of a multitude of strategies for the class imbalance issue. However, not all the strategies are useful or provide good performance in different imbalance scenarios. There are numerous approaches to dealing with imbalanced data, but the efficacy of such techniques or an experimental comparison among those techniques has not been conducted. In this study, we present a comprehensive analysis of 26 popular sampling techniques to understand their effectiveness in dealing with imbalanced data. Rigorous experiments have been conducted on 50 datasets with different degrees of imbalance to thoroughly investigate the performance of these techniques. A detailed discussion of the advantages and limitations of the techniques, as well as how to overcome such limitations, has been presented. We identify some critical factors that affect the sampling strategies and provide recommendations on how to choose an appropriate sampling technique for a particular application.