论文标题

数据湖中的数据集发现

Dataset Discovery in Data Lakes

论文作者

Bogatu, Alex, Fernandes, Alvaro A. A., Paton, Norman W., Konstantinou, Nikolaos

论文摘要

数据分析将受益于持有的数据集的可用性的增加而没有明确知道其概念关系。收集后,这些数据集形成一个数据湖,通过该数据集,可以通过该数据湖进行数据纠纷,可以构建特定的目标数据集,从而实现增值分析。鉴于此类数据湖泊的潜在广泛,因此出现了如何从湖中退出那些可能有助于解决给定目标的数据集。我们将其称为数据湖中数据集发现的问题,本文为其做出了有效而有效的解决方案。我们的方法使用数据集中的值的特征来构建基于哈希的索引,这些索引将这些特征映射到均匀的距离空间中。这使得可以定义特征之间的相似性距离,并将这些距离作为相关性W.R.T.的测量。目标表。鉴于后者(和示例元素),我们的方法返回了湖中最相关的桌子。我们提供了对方法的详细描述,并报告了两种相关性(可联合性和可加入性)与先前工作的经验结果的报告,这些工作是相关的,并在所有精度,召回,目标覆盖率,索引和发现时间上都显示出显着改善。

Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash-based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源