论文标题
GitHub上的Jupyter笔记本:特征和代码克隆
Jupyter Notebooks on GitHub: Characteristics and Code Clones
论文作者
论文摘要
Jupyter笔记本已成为数据科学编程的标准工具。 jupyter笔记本中的程序与典型程序不同,因为它们是由文本和可视化交织的代码片段集合构建的。这允许交互式探索和片段以不同的顺序执行,这可能会导致不同的结果,因为片段之间的副作用。先前的研究表明,在所谓的系统编程语言和所谓的脚本语言中,在传统程序来源中存在相当大的代码重复 - 代码克隆。在本文中,我们介绍了Jupyter笔记本中的第一个大规模克隆大规模研究。我们分析了在Githjub上托管的270万个Jupyter笔记本的语料库,代表3700万个单独的片段和2.27亿行代码。我们在单个片段的水平上研究克隆,并研究片段在多个笔记本上重复出现的程度。我们研究了相同的克隆和近似克隆,并对最常见的克隆进行了小规模的眼部检查。我们发现,代码克隆在Jupyter笔记本电脑中很常见 - 所有代码段中有70%以上是其他片段的确切副本(在白色空间中可能存在差异),并且所有笔记本的大约50%都没有任何独特的片段,但完全由smippets组成。在用Python编写的笔记本中,所有片段中至少有80%是近似克隆,而Python的代码克隆的流行率高于其他语言。我们进一步发现,不同存储库之间的克隆比同一存储库中的克隆更为普遍。但是,jupyter笔记本中包含克隆的最常见的单个存储库本身是居住的存储库。
Jupyter notebooks has emerged as a standard tool for data science programming. Programs in Jupyter notebooks are different from typical programs as they are constructed by a collection of code snippets interleaved with text and visualisation. This allows interactive exploration and snippets may be executed in different order which may give rise to different results due to side-effects between snippets. Previous studies have shown the presence of considerable code duplication -- code clones -- in sources of traditional programs, in both so-called systems programming languages and so-called scripting languages. In this paper we present the first large-scale study of code cloning in Jupyter notebooks. We analyse a corpus of 2.7 million Jupyter notebooks hosted on GitHJub, representing 37 million individual snippets and 227 million lines of code. We study clones at the level of individual snippets, and study the extent to which snippets are recurring across multiple notebooks. We study both identical clones and approximate clones and conduct a small-scale ocular inspection of the most common clones. We find that code cloning is common in Jupyter notebooks -- more than 70% of all code snippets are exact copies of other snippets (with possible differences in white spaces), and around 50% of all notebooks do not have any unique snippet, but consists solely of snippets that are also found elsewhere. In notebooks written in Python, at least 80% of all snippets are approximate clones and the prevalence of code cloning is higher in Python than in other languages. We further find that clones between different repositories are far more common than clones within the same repository. However, the most common individual repository from which a Jupyter notebook contains clones is the repository in which itself resides.