论文标题
结构化模型的近似交叉验证
Approximate Cross-Validation for Structured Models
论文作者
论文摘要
许多现代数据分析受益于数据中的明确建模依赖性结构,例如跨时间或空间的测量,句子中的有序单词或基因组中的基因。黄金标准评估技术是结构化的交叉验证(CV),它列出了每个折叠中的一些数据子集(例如时间间隔内的数据或地理区域中的数据)。但是,由于需要多次重新运行本来已经付出的学习算法,因此这里的简历可以过慢地放慢速度。先前的工作表明,在经验风险最小化的情况下,近似交叉验证(ACV)方法为快速且可证明的替代方案提供了一种准确的替代方法。但是,这种现有的ACV工作仅限于更简单的模型,即(i)CV折叠的数据是独立的,并且(ii)确切的初始模型拟合。在结构化数据分析中,这两个假设通常都是不真实的。在目前的工作中,我们通过将ACV扩展到折叠之间依赖性结构的CV方案来解决(i)。为了解决(ii),我们在理论上和经验上都验证了ACV质量在初始拟合中的噪声顺利恶化。我们证明了我们提出的方法对各种现实世界应用的准确性和计算益处。
Many modern data analyses benefit from explicitly modeling dependence structure in data -- such as measurements across time or space, ordered words in a sentence, or genes in a genome. A gold standard evaluation technique is structured cross-validation (CV), which leaves out some data subset (such as data within a time interval or data in a geographic region) in each fold. But CV here can be prohibitively slow due to the need to re-run already-expensive learning algorithms many times. Previous work has shown approximate cross-validation (ACV) methods provide a fast and provably accurate alternative in the setting of empirical risk minimization. But this existing ACV work is restricted to simpler models by the assumptions that (i) data across CV folds are independent and (ii) an exact initial model fit is available. In structured data analyses, both these assumptions are often untrue. In the present work, we address (i) by extending ACV to CV schemes with dependence structure between the folds. To address (ii), we verify -- both theoretically and empirically -- that ACV quality deteriorates smoothly with noise in the initial fit. We demonstrate the accuracy and computational benefits of our proposed methods on a diverse set of real-world applications.