QUIP：查询驱动的缺失价值插补

论文标题

QUIP：查询驱动的缺失价值插补

QUIP: Query-driven Missing Value Imputation

论文作者

Lin, Yiming, Mehrotra, Sharad

论文摘要

现实世界中的数据集中普遍存在缺失值，并且未能清洁缺失数据可能会导致查询答案的质量差。 \ yiming {传统上，缺少价值插补是一个离线过程，作为准备数据进行分析的一部分。}本文研究查询时间缺失价值插补并提出QUIP，这只会将最小的缺失值算出来回答查询。具体而言，通过将合理的良好查询计划作为输入，QUIP试图最大程度地减少缺少的价值归档成本和查询处理开销。 Quip提出了一个新的外部JOIN实现，以保留查询处理中的缺失值和基于Bloom滤波器的索引结构，以优化空间和运行时开销。 QUIP还设计了一个基于成本的决策功能，可以自动指导每个操作员立即估算缺失值或延迟插入。提出了有效的优化，以加快QUIP中的总体操作，例如最大/分钟操作员。对真实和合成数据集进行的广泛实验证明了QUIP的有效性和效率，在不同的查询集和数据集上，QUIP的有效性和效率超过了最先进的估算值2至10次，并实现了在离线方法上的杂项提高。

Missing values widely exist in real-world data sets, and failure to clean the missing data may result in the poor quality of answers to queries. \yiming{Traditionally, missing value imputation has been studied as an offline process as part of preparing data for analysis.} This paper studies query-time missing value imputation and proposes QUIP, which only imputes minimal missing values to answer the query. Specifically, by taking a reasonable good query plan as input, QUIP tries to minimize the missing value imputation cost and query processing overhead. QUIP proposes a new implementation of outer join to preserve missing values in query processing and a bloom filter based index structure to optimize the space and runtime overhead. QUIP also designs a cost-based decision function to automatically guide each operator to impute missing values now or delay imputations. Efficient optimizations are proposed to speed-up aggregate operations in QUIP, such as MAX/MIN operator. Extensive experiments on both real and synthetic data sets demonstrates the effectiveness and efficiency of QUIP, which outperforms the state-of-the-art ImputeDB by 2 to 10 times on different query sets and data sets, and achieves the order-of-magnitudes improvement over the offline approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题