论文标题
矩阵完成,具有量化的不确定性通过低级高斯副群
Matrix Completion with Quantified Uncertainty through Low Rank Gaussian Copula
论文作者
论文摘要
现代大型数据集通常会困扰着缺失的条目。对于缺少值的表格数据,一连串的插图算法求解了一个完整的矩阵,该矩阵可最大程度地减少一些惩罚的重建误差。但是,几乎没有一个人无法估计其归纳的不确定性。本文提出了一个概率且可扩展的框架,用于量化不确定性缺失价值。我们的模型,低级高斯副总统,增强了标准概率模型,概率的主成分分析,每列的边缘转换允许该模型更好地匹配数据的分布。它自然地处理布尔,序数和实价观测值,并量化每个插补的不确定性。拟合模型所需的时间与行数和数据集中的列数线性缩放。经验结果表明,该方法在包括高级的数据类型(包括较高的数据类型)上产生了最先进的归合精度。我们的不确定性度量可以很好地预测归因误差:不确定性较低的条目确实具有较低的归因误差(平均而言)。此外,对于实现数据,所得的置信区间得到了良好的校准。
Modern large scale datasets are often plagued with missing entries. For tabular data with missing values, a flurry of imputation algorithms solve for a complete matrix which minimizes some penalized reconstruction error. However, almost none of them can estimate the uncertainty of its imputations. This paper proposes a probabilistic and scalable framework for missing value imputation with quantified uncertainty. Our model, the Low Rank Gaussian Copula, augments a standard probabilistic model, Probabilistic Principal Component Analysis, with marginal transformations for each column that allow the model to better match the distribution of the data. It naturally handles Boolean, ordinal, and real-valued observations and quantifies the uncertainty in each imputation. The time required to fit the model scales linearly with the number of rows and the number of columns in the dataset. Empirical results show the method yields state-of-the-art imputation accuracy across a wide range of data types, including those with high rank. Our uncertainty measure predicts imputation error well: entries with lower uncertainty do have lower imputation error (on average). Moreover, for real-valued data, the resulting confidence intervals are well-calibrated.