论文标题
Warpgate:云数据仓库的语义加入发现系统
WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses
论文作者
论文摘要
数据发现是企业数据分析中的一个主要挑战:用户通常很难找到与其分析目标相关的数据,甚至跨数据源导航,每个数据都可能很容易包含数千个表。一个常见的用户需求是发现与给定表的可加入的表。这种需求尤其重要,因为加入是数据分析中的无处不在操作,并且联接路径大多对用户,尤其是跨数据库而言是晦涩的。此外,用户通常有兴趣查找“语义上”可加入表:即使可以转换为可加入的列,即使他们不是在数据存储中所代表的。我们提出Warpgate,这是一个用于云数据仓库数据发现的系统原型。 Warpgate实现了基于嵌入式的语义联接发现的解决方案,该解决方案将列编码为高维矢量空间,以便可加入的列映射到彼此靠近的点。通过在几个表中心的实验中,我们表明Warpgate(i)捕获了表(尤其是数据库的表)之间的语义关系,并且(ii)是示例有效的,因此可以扩展到数百万行的非常大的表。我们还向企业产品中的云数据分析中的应用程序展示了一种应用程序。
Data discovery is a major challenge in enterprise data analysis: users often struggle to find data relevant to their analysis goals or even to navigate through data across data sources, each of which may easily contain thousands of tables. One common user need is to discover tables joinable with a given table. This need is particularly critical because join is a ubiquitous operation in data analysis, and join paths are mostly obscure to users, especially across databases. Furthermore, users are typically interested in finding ``semantically'' joinable tables: with columns that can be transformed to become joinable even if they are not joinable as currently represented in the data store. We present WarpGate, a system prototype for data discovery over cloud data warehouses. WarpGate implements an embedding-based solution to semantic join discovery, which encodes columns into high-dimensional vector space such that joinable columns map to points that are near each other. Through experiments on several table corpora, we show that WarpGate (i) captures semantic relationships between tables, especially those across databases, and (ii) is sample efficient and thus scalable to very large tables of millions of rows. We also showcase an application of WarpGate within an enterprise product for cloud data analytics.