论文标题
稀疏质心编码器:用于特征选择的非线性模型
Sparse Centroid-Encoder: A Nonlinear Model for Feature Selection
论文作者
论文摘要
自动编码器已被广泛用作降低数据维度的非线性工具。虽然自动编码器不使用标签信息,但质心编码器(CE)\ cite {ghosh20222superpised}在其学习过程中使用类标签。在这项研究中,我们提出了使用Centroid-编码器体系结构进行稀疏优化,以确定一组最小的特征,以区分两个或多个类别。所得的算法,稀疏的质心编码器(SCE),使用稀疏性诱导$ \ ell_1 $ -norm提取歧视性特征,同时将其映射到其类质体的指点。 SCE的一个关键属性是,它可以从多模式数据集(即其类似乎具有多个群集的数据集)中提取信息性功能。该算法应用于多种现实世界数据集,包括单细胞数据,高维生物学数据,图像数据,语音数据和加速度计传感器数据。我们将我们的方法与各种最先进的特征选择技术进行了比较,包括监督的混凝土自动编码器(SCAE),功能选择网络(FSNET),深度特征选择(DFS),随机门(STG)和Lassonet。我们从经验上表明,SCE特征通常比隔离测试集中的其他方法产生更好的分类精度。
Autoencoders have been widely used as a nonlinear tool for data dimensionality reduction. While autoencoders don't utilize the label information, Centroid-Encoders (CE)\cite{ghosh2022supervised} use the class label in their learning process. In this study, we propose a sparse optimization using the Centroid-Encoder architecture to determine a minimal set of features that discriminate between two or more classes. The resulting algorithm, Sparse Centroid-Encoder (SCE), extracts discriminatory features in groups using a sparsity inducing $\ell_1$-norm while mapping a point to its class centroid. One key attribute of SCE is that it can extract informative features from a multi-modal data set, i.e., data sets whose classes appear to have multiple clusters. The algorithm is applied to a wide variety of real world data sets, including single-cell data, high dimensional biological data, image data, speech data, and accelerometer sensor data. We compared our method to various state-of-the-art feature selection techniques, including supervised Concrete Autoencoders (SCAE), Feature Selection Network (FsNet), deep feature selection (DFS), Stochastic Gate (STG), and LassoNet. We empirically showed that SCE features often produced better classification accuracy than other methods on sequester test set.