论文标题

计算笔记本上的相似性搜索

Similarity Search on Computational Notebooks

论文作者

Horiuchi, Misato, Sasaki, Yuya, Xiao, Chuan, Onizuka, Makoto

论文摘要

诸如Jupyter Notebook之类的计算笔记本软件在数据科学任务中很受欢迎。网络上有许多计算笔记本,可重复使用;但是,手动搜索计算笔记本是一项繁琐的任务,到目前为止,还没有工具可以有效,有效地搜索计算笔记本。在本文中,我们提出了对计算笔记本电脑的相似性搜索,并为相似性搜索开发了一个新框架。在计算笔记本中,给定的内容(即,源代码,表格数据,库和输出格式)作为查询,相似性搜索问题旨在查找具有最相似内容的Top-K计算笔记本。我们定义了两种相似性措施;基于集合和基于图的相似性。基于集合的相似性独立处理每个内容,而基于图的相似性捕获了内容之间的关系。我们的框架可以有效地修剪不应该在TOP-K结果中的计算笔记本候选者。此外,我们开发了优化技术,例如缓存和索引以加速搜索。使用Kaggle Notebooks进行的实验表明,我们的方法,特别是基于图的相似性,可以达到高精度和高效率。

Computational notebook software such as Jupyter Notebook is popular for data science tasks. Numerous computational notebooks are available on the Web and reusable; however, searching for computational notebooks manually is a tedious task, and so far, there are no tools to search for computational notebooks effectively and efficiently. In this paper, we propose a similarity search on computational notebooks and develop a new framework for the similarity search. Given contents (i.e., source codes, tabular data, libraries, and outputs formats) in computational notebooks as a query, the similarity search problem aims to find top-k computational notebooks with the most similar contents. We define two similarity measures; set-based and graph-based similarities. Set-based similarity handles each content independently, while graph-based similarity captures the relationships between contents. Our framework can effectively prune the candidates of computational notebooks that should not be in the top-k results. Furthermore, we develop optimization techniques such as caching and indexing to accelerate the search. Experiments using Kaggle notebooks show that our method, in particular graph-based similarity, can achieve high accuracy and high efficiency.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源