论文标题

ERIS:测量多维数据源之间的不和谐

Eris: Measuring discord among multidimensional data sources

论文作者

Abello, Alberto, Cheney, James

论文摘要

数据集成是数据库中的经典问题,通常分解为模式匹配,实体匹配和数据融合。要解决后者,大多数假定可以确定地面真理。但是,通常,不同来源中的数据收集过程是不完美的,无法提供准确的值合并。因此,在没有确定地面真理的方法的情况下,至少量化数据集的内部一致性至少有多远很重要。因此,我们提出了一致数据的定义,并定义了不一致的指标,作为衡量分歧以改善基于可信赖性的决策的一种方式。 我们定义了数值属性的分配测量问题,其中给出了一组不确定的原始观察结果或骨料结果(例如案例/住院/死亡数据与COVID-19)以及有关相同现实的不同概念的信息的信息(例如,粒度或单位)是否相处,或者是不同的,我们是否希望与范围进行相处的态度,或者是否相处,或者是否有限制的方式,或者是condistify,是否有限制的态度。 是。我们还定义了一组代数运算符,以用正确性保证来描述不同数据源的对齐,以及两个替代的关系数据库实现,以将问题减少到线性或二次编程。这些对COVID-19和合成数据都进行了评估,我们的实验结果表明,在现实情况下可以有效地进行不和谐测量。

Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and data fusion. To solve the latter, it is mostly assumed that ground truth can be determined. However, in general, the data gathering processes in the different sources are imperfect and cannot provide an accurate merging of values. Thus, in the absence of ways to determine ground truth, it is important to at least quantify how far from being internally consistent a dataset is. Hence, we propose definitions of concordant data and define a discordance metric as a way of measuring disagreement to improve decision making based on trustworthiness. We define the discord measurement problem of numerical attributes in which given a set of uncertain raw observations or aggregate results (such as case/hospitalization/death data relevant to COVID-19) and information on the alignment of different conceptualizations of the same reality (e.g., granularities or units), we wish to assess whether the different sources are concordant, or if not, use the discordance metric to quantify how discordant they are. We also define a set of algebraic operators to describe the alignments of different data sources with correctness guarantees, together with two alternative relational database implementations that reduce the problem to linear or quadratic programming. These are evaluated against both COVID-19 and synthetic data, and our experimental results show that discordance measurement can be performed efficiently in realistic situations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源