论文标题
高维低样本尺寸数据的分类
The classification for High-dimension low-sample size data
论文作者
论文摘要
在各个领域的大量应用,例如基因表达分析或计算机视觉,具有高维低样本尺寸(HDLS)的数据集,这对标准统计和现代机器学习方法提出了巨大的挑战。在本文中,我们提出了一个关于HDLSS的新分类标准,即耐受性相似性,该标准强调了阶级可分离性前提的阶级内差异的最大化。根据此标准,设计了一种新型的线性二进制分类器,并用无分段数据最大分散分类器(NPDMD)表示。 NPDMD的目的是找到一个投射方向w,其中所有训练样本都散布在尽可能大的间隔中。与最新的分类方法相比,NPDMD具有多种特征。首先,它在HDLSS上运行良好。其次,它将样本统计信息和局部结构信息(支持向量)结合到目标函数中,以找到在整个特征空间中投射方向的解决方案。第三,它解决了低维空间中高维矩阵的倒数。第四,基于二次编程实现相对简单。第五,对于各种真实应用程序的模型规范是可靠的。推导了NPDMD的理论特性。我们对一个模拟和六个实际基准数据集进行了一系列评估,包括面部分类和mRNA分类。在大多数情况下,NPDMD的表现优于那些广泛使用的方法,或者至少获得可比的结果。
Huge amount of applications in various fields, such as gene expression analysis or computer vision, undergo data sets with high-dimensional low-sample-size (HDLSS), which has putted forward great challenges for standard statistical and modern machine learning methods. In this paper, we propose a novel classification criterion on HDLSS, tolerance similarity, which emphasizes the maximization of within-class variance on the premise of class separability. According to this criterion, a novel linear binary classifier is designed, denoted by No-separated Data Maximum Dispersion classifier (NPDMD). The objective of NPDMD is to find a projecting direction w in which all of training samples scatter in as large an interval as possible. NPDMD has several characteristics compared to the state-of-the-art classification methods. First, it works well on HDLSS. Second, it combines the sample statistical information and local structural information (supporting vectors) into the objective function to find the solution of projecting direction in the whole feature spaces. Third, it solves the inverse of high dimensional matrix in low dimensional space. Fourth, it is relatively simple to be implemented based on Quadratic Programming. Fifth, it is robust to the model specification for various real applications. The theoretical properties of NPDMD are deduced. We conduct a series of evaluations on one simulated and six real-world benchmark data sets, including face classification and mRNA classification. NPDMD outperforms those widely used approaches in most cases, or at least obtains comparable results.