计算笔记本上的相似性搜索

论文标题

计算笔记本上的相似性搜索

Similarity Search on Computational Notebooks

论文作者

Horiuchi, Misato, Sasaki, Yuya, Xiao, Chuan, Onizuka, Makoto

论文摘要

诸如Jupyter Notebook之类的计算笔记本软件在数据科学任务中很受欢迎。网络上有许多计算笔记本，可重复使用；但是，手动搜索计算笔记本是一项繁琐的任务，到目前为止，还没有工具可以有效，有效地搜索计算笔记本。在本文中，我们提出了对计算笔记本电脑的相似性搜索，并为相似性搜索开发了一个新框架。在计算笔记本中，给定的内容（即，源代码，表格数据，库和输出格式）作为查询，相似性搜索问题旨在查找具有最相似内容的Top-K计算笔记本。我们定义了两种相似性措施；基于集合和基于图的相似性。基于集合的相似性独立处理每个内容，而基于图的相似性捕获了内容之间的关系。我们的框架可以有效地修剪不应该在TOP-K结果中的计算笔记本候选者。此外，我们开发了优化技术，例如缓存和索引以加速搜索。使用Kaggle Notebooks进行的实验表明，我们的方法，特别是基于图的相似性，可以达到高精度和高效率。

Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task, and so far, there are no tools to search for computational notebooks effectively and efficiently. In this paper, we propose a similarity search on computational notebooks and develop a new framework for the similarity search. Given contents (i.e., source codes, tabular data, libraries, and outputs formats) in computational notebooks as a query, the similarity search problem aims to find top-k computational notebooks with the most similar contents. We define two similarity measures; set-based and graph-based similarities. Set-based similarity handles each content independently, while graph-based similarity captures the relationships between contents. Our framework can effectively prune the candidates of computational notebooks that should not be in the top-k results. Furthermore, we develop optimization techniques such as caching and indexing to accelerate the search. Experiments using Kaggle notebooks show that our method, in particular graph-based similarity, can achieve high accuracy and high efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题