论文标题
缺少分类问题的数据插补
Missing Data Imputation for Classification Problems
论文作者
论文摘要
丢失数据的插补是在特征训练矩阵缺失的各种分类问题中的常见应用。该插补问题的一种广泛使用的解决方案是基于懒惰的学习技术,即$ k $ - 最终的邻居(KNN)方法。但是,以前关于丢失数据的大多数工作都没有考虑到分类问题中类标签的存在。同样,现有的KNN归合方法使用Minkowski距离的变体作为距离的量度,与异质数据无法很好地工作。在本文中,我们根据缺失基准和所有训练数据之间的类加权灰距离提出了一种新型的迭代KNN归合技术。灰色距离在异质数据中效果很好,缺少实例。距离通过互信息(MI)加权,这是特征和类标签之间特征相关性的量度。这样可以确保训练数据的归咎于提高分类性能。与其他KNN归合算法相比,该类加权的灰色KNN归合算法在插补和分类问题中表现出改善的性能,以及标准的归合算法,例如小鼠和MISSForest。这些问题基于模拟方案和UCI数据集,其丢失率各不相同。
Imputation of missing data is a common application in various classification problems where the feature training matrix has missingness. A widely used solution to this imputation problem is based on the lazy learning technique, $k$-nearest neighbor (kNN) approach. However, most of the previous work on missing data does not take into account the presence of the class label in the classification problem. Also, existing kNN imputation methods use variants of Minkowski distance as a measure of distance, which does not work well with heterogeneous data. In this paper, we propose a novel iterative kNN imputation technique based on class weighted grey distance between the missing datum and all the training data. Grey distance works well in heterogeneous data with missing instances. The distance is weighted by Mutual Information (MI) which is a measure of feature relevance between the features and the class label. This ensures that the imputation of the training data is directed towards improving classification performance. This class weighted grey kNN imputation algorithm demonstrates improved performance when compared to other kNN imputation algorithms, as well as standard imputation algorithms such as MICE and missForest, in imputation and classification problems. These problems are based on simulated scenarios and UCI datasets with various rates of missingness.