数据湖中有效的可加入的表发现：一种基于高维相似的方法

论文标题

数据湖中有效的可加入的表发现：一种基于高维相似的方法

Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

论文作者

Dong, Yuyang, Takeoka, Kunihiro, Xiao, Chuan, Oyamada, Masafumi

论文摘要

在数据湖中查找可加入的表是许多应用程序中的关键过程，例如数据集成，数据增强，数据分析和数据市场。找到可拨打表的传统方法无法处理拼写错误和不同的格式，也不会捕获任何语义连接。在本文中，我们提出了Pexeso，这是一个在数据湖中加入表发现的框架。我们将文本值嵌入为高维矢量，并在高维矢量上的相似性谓词下加入列，因此可以解决均值加入方法的局限性并确定更有意义的结果。为了有效地找到具有相似性的可加入表，我们提出了一种利用基于枢轴的过滤的块和验证方法。开发了一种分区技术来应对数据湖大且指数无法适应主内存的情况。对真实数据集的实验评估表明，我们的解决方案比Equi-Joins识别出更多的表，并且优于其他基于相似性的选项，并且联接结果可用于机器学习任务的数据丰富。该实验还证明了该方法的效率。

Finding joinable tables in data lakes is key procedure in many applications such as data integration, data augmentation, data analysis, and data market. Traditional approaches that find equi-joinable tables are unable to deal with misspellings and different formats, nor do they capture any semantic joins. In this paper, we propose PEXESO, a framework for joinable table discovery in data lakes. We embed textual values as high-dimensional vectors and join columns under similarity predicates on high-dimensional vectors, hence to address the limitations of equi-join approaches and identify more meaningful results. To efficiently find joinable tables with similarity, we propose a block-and-verify method that utilizes pivot-based filtering. A partitioning technique is developed to cope with the case when the data lake is large and the index cannot fit in main memory. An experimental evaluation on real datasets shows that our solution identifies substantially more tables than equi-joins and outperforms other similarity-based options, and the join results are useful in data enrichment for machine learning tasks. The experiments also demonstrate the efficiency of the proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题