论文标题

桑托斯:基于关系的语义表工会搜索

SANTOS: Relationship-based Semantic Table Union Search

论文作者

Khatiwada, Aamod, Fan, Grace, Shraga, Roee, Chen, Zixuan, Gatterbauer, Wolfgang, Miller, Renée J., Riedewald, Mirek

论文摘要

现有的联合表搜索技术使用元数据(表必须具有相同或相似的模式)或基于列的指标(例如,表中的值应从同一域中绘制)。在这项工作中,我们在表中介绍了一对列之间的语义关系,以提高工会搜索的准确性。因此,我们介绍了一种新的联合性概念,该概念以原则上的方式考虑了列之间的关系以及列的语义。为此,我们提出了两种新方法,以发现列对之间的语义关系。第一个使用现有知识库(KB),第二个(我们称为“合成的KB”)使用数据湖本身的知识。我们采用现有的表工会搜索基准,并提供代表大小的真实数据湖泊的新(开放)基准。我们表明,我们的新的联合性搜索算法(称为Santos)优于最先进的联合搜索,该搜索使用了各种基于列的语义,包括单词嵌入式和正则表达式。我们从经验上表明,我们的合成KB通过表示可能不包含在可用的KB中的关系语义来提高工会搜索的准确性。该结果暗示了从具有有限的KB覆盖范围的数据湖中创建合成的KB并将其用于工会搜索的有希望的未来。

Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns. The first uses an existing knowledge base (KB), the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KBs from data lakes with limited KB coverage and using them for union search.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源