论文标题
关于无监督的预测和二阶信号
On unsupervised projections and second order signals
论文作者
论文摘要
线性投影被广泛用于高维数据的分析。在无监督的环境中,数据包含潜在类别/集群,在投影下保留了类别歧视性信号的问题至关重要。在班级之间的平均差异的情况下,对这个问题进行了充分的研究。但是,在许多当代应用中,尤其是在生物医学中,协方差或图形模型结构水平的群体差异很重要。在本文中,我们询问线性预测是否可以保留潜在组之间的二阶结构差异。我们专注于无监督的预测,可以在不了解类标签的情况下计算。我们讨论了一个简单的理论框架,以研究我们用来通过准阐明为分析提供信息的这种预测的行为。这使我们可以考虑超过十万个数据生成的人群参数的性能,即两个流行的预测,即随机预测(RP)和主成分分析(PCA)。在这一广泛的政权中,PCA在保留二阶信号方面比RP更有效,并且通常甚至在监督投影中竞争。我们通过完全经验实验来补充这些结果,以显示使用模拟和真实数据0-1损失。我们还研究投影维度的效果,引起人们对这方面的偏见变化权衡的关注。我们的结果表明,PCA确实可以成为无监督分析的合适的第一步,包括在差异协方差或图形模型结构的情况下。
Linear projections are widely used in the analysis of high-dimensional data. In unsupervised settings where the data harbour latent classes/clusters, the question of whether class discriminatory signals are retained under projection is crucial. In the case of mean differences between classes, this question has been well studied. However, in many contemporary applications, notably in biomedicine, group differences at the level of covariance or graphical model structure are important. Motivated by such applications, in this paper we ask whether linear projections can preserve differences in second order structure between latent groups. We focus on unsupervised projections, which can be computed without knowledge of class labels. We discuss a simple theoretical framework to study the behaviour of such projections which we use to inform an analysis via quasi-exhaustive enumeration. This allows us to consider the performance, over more than a hundred thousand sets of data-generating population parameters, of two popular projections, namely random projections (RP) and Principal Component Analysis (PCA). Across this broad range of regimes, PCA turns out to be more effective at retaining second order signals than RP and is often even competitive with supervised projection. We complement these results with fully empirical experiments showing 0-1 loss using simulated and real data. We study also the effect of projection dimension, drawing attention to a bias-variance trade-off in this respect. Our results show that PCA can indeed be a suitable first-step for unsupervised analysis, including in cases where differential covariance or graphical model structure are of interest.