论文标题
为数据集属性选择合适的重新采样策略,以进行不平衡的数据分类
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties
论文作者
论文摘要
在许多应用领域,例如医学,信息检索,网络安全,社交媒体等,用于诱导分类模型的数据集通常对每个类别的实例都有不平等的分布。这种情况称为数据分类不平衡,导致少数族类示例的预测性能低。因此,尽管可以接受总体模型精度,但预测模型是不可靠的。过采样和不足采样技术是通过平衡每个班级的示例数量来解决此问题的众所周知的策略。但是,它们的有效性取决于主要与数据固有特征有关的几个因素,例如不平衡比,数据集大小和维度,类别或边界示例之间重叠。在这项工作中,通过一项全面的比较研究分析了这些因素的影响,涉及来自不同应用领域的40个数据集。目的是获得基于其特征的任何数据集自动选择最佳重采样策略的模型。这些模型使我们能够同时检查几个因素,因为它们是从涵盖广泛条件的非常多样的数据集中诱导的。这与大多数研究的不同,该研究集中于对特征的个体分析或涵盖少量值。此外,该研究涵盖了基本和高级重采样策略,这些策略是通过八个不同的性能指标进行评估的,包括专门为数据分类而设计的新措施。该提案的一般性质允许选择最合适的方法,无论该领域如何,避免搜索可能对目标数据有效的特殊目的技术。
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class. However, their effectiveness depends on several factors mainly related to data intrinsic characteristics, such as imbalance ratio, dataset size and dimensionality, overlapping between classes or borderline examples. In this work, the impact of these factors is analyzed through a comprehensive comparative study involving 40 datasets from different application areas. The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics. These models allow us to check several factors simultaneously considering a wide range of values since they are induced from very varied datasets that cover a broad spectrum of conditions. This differs from most studies that focus on the individual analysis of the characteristics or cover a small range of values. In addition, the study encompasses both basic and advanced resampling strategies that are evaluated by means of eight different performance metrics, including new measures specifically designed for imbalanced data classification. The general nature of the proposal allows the choice of the most appropriate method regardless of the domain, avoiding the search for special purpose techniques that could be valid for the target data.