论文标题
有监督的凸聚类
Supervised Convex Clustering
论文作者
论文摘要
长期以来,聚类一直是一种流行的无监督学习方法,可以识别许多应用程序中未标记数据的模式。然而,由于其无监督的性质,对估计集群的有意义的解释经常挑战。同时,在许多实际情况下,有一些嘈杂的监督辅助变量,例如主观诊断观点,这些变量与观察到的未标记数据的异质性有关。通过利用监督辅助变量和未标记数据的信息,我们试图发现更科学的可解释的群体结构,这些结构可能会被完全无监督的分析隐藏。在这项工作中,我们提出并开发了一种名为“监督凸聚类”(SCC)的新的统计模式发现方法,该方法从信息源和指南中借用了强度,并通过联合凸融合惩罚来寻找更多可解释的模式。我们开发了几种SCC的扩展,以整合不同类型的监督辅助变量,调整其他协变量并找到双簇。我们通过模拟和关于阿尔茨海默氏病基因组学的案例研究证明了SCC的实际优势。具体而言,我们发现了新的候选基因以及阿尔茨海默氏病的新亚型,这些基因可能会更好地理解导致老年人认知能力下降异质性的基本遗传机制。
Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to its unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named Supervised Convex Clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's Disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's Disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults.