论文标题
D-GCCA:基于分解的广义规范相关分析,用于多视图高维数据
D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multi-view High-dimensional Data
论文作者
论文摘要
现代生物医学研究通常会收集多视图数据,即在同一对象集中测量的多种类型的数据。高维多视图数据分析中的一个流行模型是将每个视图的数据矩阵分解为一个由所有数据视图中常见的潜在因素产生的低级别的通用源矩阵,与每个视图相对应的低级别独特源矩阵以及一个添加噪声矩阵。我们为该模型提出了一种新颖的分解方法,称为基于分解的广义规范相关分析(D-GCCA)。与大多数现有方法使用的欧几里得点产品空间相比,D-GCCA严格地定义了随机变量的L2空间的分解,从而能够为低率矩阵恢复提供估计一致性。此外,为了很好地校准常见的潜在因素,我们对独特的潜在因素施加了理想的正交性约束。但是,现有方法不足以考虑这种正交性,因此可能会遭受未发现的普通源差异的实质性损失。我们的D-GCCA比广义的规范相关分析更进一步,通过在规范变量之间分离共同和独特的组成部分,同时从主要成分分析的角度享受着一种吸引人的解释。此外,我们建议使用由常见或独特的潜在因素解释的信号方差的可变级别比例,以选择最受影响的变量。我们的D-GCCA方法的一致估计器以良好的有限样本数值性能建立,并具有闭合表达式,从而导致有效的计算,尤其是对于大型数据。在模拟和现实世界数据示例中,D-GCCA比最先进方法的优势也得到了证实。
Modern biomedical studies often collect multi-view data, that is, multiple types of data measured on the same set of objects. A popular model in high-dimensional multi-view data analysis is to decompose each view's data matrix into a low-rank common-source matrix generated by latent factors common across all data views, a low-rank distinctive-source matrix corresponding to each view, and an additive noise matrix. We propose a novel decomposition method for this model, called decomposition-based generalized canonical correlation analysis (D-GCCA). The D-GCCA rigorously defines the decomposition on the L2 space of random variables in contrast to the Euclidean dot product space used by most existing methods, thereby being able to provide the estimation consistency for the low-rank matrix recovery. Moreover, to well calibrate common latent factors, we impose a desirable orthogonality constraint on distinctive latent factors. Existing methods, however, inadequately consider such orthogonality and may thus suffer from substantial loss of undetected common-source variation. Our D-GCCA takes one step further than generalized canonical correlation analysis by separating common and distinctive components among canonical variables, while enjoying an appealing interpretation from the perspective of principal component analysis. Furthermore, we propose to use the variable-level proportion of signal variance explained by common or distinctive latent factors for selecting the variables most influenced. Consistent estimators of our D-GCCA method are established with good finite-sample numerical performance, and have closed-form expressions leading to efficient computation especially for large-scale data. The superiority of D-GCCA over state-of-the-art methods is also corroborated in simulations and real-world data examples.