论文标题
学习可扩展和归纳子空间聚类的自我表达指标
Learning Self-Expression Metrics for Scalable and Inductive Subspace Clustering
论文作者
论文摘要
子空间聚类已确立为聚类高维数据的最新方法。特别是,依靠自我表达属性的方法最近被证明是特别成功的。但是,它们遭受了两个主要的缺点:首先,直接学习了二次大小系数矩阵,从而阻止了这些方法扩展到小数据集之外。其次,训练有素的模型具有偏置性,因此不能在训练过程中将样本外数据聚集。我们没有直接学习自我表达的系数,而是提出了一种新型的度量学习方法来学习使用暹罗神经网络体系结构的子空间亲和力函数。因此,我们的模型受益于恒定数量的参数和恒定大小的内存足迹,从而使其可以扩展到更大的数据集。此外,我们可以正式证明,在独立假设的情况下,OUT模型仍然能够精确地恢复子空间簇。暹罗结构与新型的几何分类器相结合,进一步使我们的模型感应效果,从而可以将其聚集在样本外数据。另外,可以通过简单地将自动编码器模块添加到体系结构中来检测非线性簇。然后可以以自我监督的方式对整个模型进行端到端训练。在进行中,这项工作报告了MNIST数据集上有希望的初步结果。本着可重复的研究精神,我使所有代码公开可用。在未来的工作中,我们计划研究模型的几个扩展,并扩大实验评估。
Subspace clustering has established itself as a state-of-the-art approach to clustering high-dimensional data. In particular, methods relying on the self-expressiveness property have recently proved especially successful. However, they suffer from two major shortcomings: First, a quadratic-size coefficient matrix is learned directly, preventing these methods from scaling beyond small datasets. Secondly, the trained models are transductive and thus cannot be used to cluster out-of-sample data unseen during training. Instead of learning self-expression coefficients directly, we propose a novel metric learning approach to learn instead a subspace affinity function using a siamese neural network architecture. Consequently, our model benefits from a constant number of parameters and a constant-size memory footprint, allowing it to scale to considerably larger datasets. In addition, we can formally show that out model is still able to exactly recover subspace clusters given an independence assumption. The siamese architecture in combination with a novel geometric classifier further makes our model inductive, allowing it to cluster out-of-sample data. Additionally, non-linear clusters can be detected by simply adding an auto-encoder module to the architecture. The whole model can then be trained end-to-end in a self-supervised manner. This work in progress reports promising preliminary results on the MNIST dataset. In the spirit of reproducible research, me make all code publicly available. In future work we plan to investigate several extensions of our model and to expand experimental evaluation.