论文标题
DeepJoin:可加入的表格发现,具有预训练的语言模型
DeepJoin: Joinable Table Discovery with Pre-trained Language Models
论文作者
论文摘要
由于数据富集在数据分析任务中的有用性,可加入的表发现已成为数据湖管理中的重要操作。现有的方法是针对Equi-joins的,这是组合表创建统一视图或语义连接的表的最常见方法,该方法可容忍拼写错误和不同格式,以提供更多的联接结果。它们要么是精确的解决方案,其运行时间是在查询列的大小和目标表存储库中线性线性的,要么是缺乏精确度的近似解决方案。在本文中,我们提出了DeepJoin,这是一个深度学习模型,以进行准确有效的可加入桌子发现。我们的解决方案是基于嵌入式的检索,该检索采用了预训练的语言模型(PLM),被设计为一个框架,可为等式和语义加入。我们提出了一组上下文化选项,以将列内容转换为文本序列。 PLM读取序列,并微调到将列嵌入到向量的圆柱上,以便如果列在向量空间中彼此接近,则预计列是可连接的。由于PLM的输出的长度是固定的,因此随后的搜索过程与列大小无关。使用最新的近似邻居搜索算法,搜索时间在存储库中为对数。为了训练模型,我们设计了准备培训数据和数据扩展的技术。真实数据集上的实验表明,通过对语料库的一小部分进行训练,深入概括了大型数据集及其精度始终优于其他近似解决方案。与专家的标签一起评估时,DeepJoin比精确的语义连接更准确。此外,当配备GPU时,DeepJoin的最高速度比现有解决方案快两个数量级。
Due to the usefulness in data enrichment for data analysis tasks, joinable table discovery has become an important operation in data lake management. Existing approaches target equi-joins, the most common way of combining tables for creating a unified view, or semantic joins, which tolerate misspellings and different formats to deliver more join results. They are either exact solutions whose running time is linear in the sizes of query column and target table repository or approximate solutions lacking precision. In this paper, we propose Deepjoin, a deep learning model for accurate and efficient joinable table discovery. Our solution is an embedding-based retrieval, which employs a pre-trained language model (PLM) and is designed as one framework serving both equi- and semantic joins. We propose a set of contextualization options to transform column contents to a text sequence. The PLM reads the sequence and is fine-tuned to embed columns to vectors such that columns are expected to be joinable if they are close to each other in the vector space. Since the output of the PLM is fixed in length, the subsequent search procedure becomes independent of the column size. With a state-of-the-art approximate nearest neighbor search algorithm, the search time is logarithmic in the repository size. To train the model, we devise the techniques for preparing training data as well as data augmentation. The experiments on real datasets demonstrate that by training on a small subset of a corpus, Deepjoin generalizes to large datasets and its precision consistently outperforms other approximate solutions'. Deepjoin is even more accurate than an exact solution to semantic joins when evaluated with labels from experts. Moreover, when equipped with a GPU, Deepjoin is up to two orders of magnitude faster than existing solutions.