论文标题

Diffsearch:用于代码更改的可扩展且精确的搜索引擎

DiffSearch: A Scalable and Precise Search Engine for Code Changes

论文作者

Di Grazia, Luca, Bredl, Paul, Pradel, Michael

论文摘要

成功项目的源代码一直在不断发展,导致数十万个代码更改存储在源代码存储库中。这些丰富的数据可能很有用,例如,找到类似于计划的代码更改或重复代码改进的示例的更改。本文介绍了Diffsearch,这是一个搜索引擎,给定描述代码更改的查询,返回一组匹配查询的更改。该方法由三个关键贡献启用。首先,我们提出了一种查询语言,该语言将基础编程语言与通配符和占位符扩展,提供了一种直观的方式来制定易于适应不同编程语言的查询。其次,为了确保可伸缩性,该方法索引代码在一次性预处理步骤中更改,将它们映射到特征空间中,然后在每个查询的功能空间中进行有效的搜索。第三,为了确保精确度,即任何返回的代码更改确实与给定查询匹配,我们提出了一种基于树的匹配算法,该算法检查是否可以将查询扩展到具体的代码更改。我们介绍了Java,JavaScript和Python的实现,并表明该方法在几秒钟内响应了一百万个代码更改的疑问,召回了Java的80.7%,python的89.6%,python的90.4%和JavaScript的90.4%,使用户能够更有效地进行基于表达的搜索,以更大的搜索来求职,以进行较大的搜索,以进行大量的求解和划分的数据。

The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This wealth of data can be useful, e.g., to find changes similar to a planned code change or examples of recurring code improvements. This paper presents DiffSearch, a search engine that, given a query that describes a code change, returns a set of changes that match the query. The approach is enabled by three key contributions. First, we present a query language that extends the underlying programming language with wildcards and placeholders, providing an intuitive way of formulating queries that is easy to adapt to different programming languages. Second, to ensure scalability, the approach indexes code changes in a one-time preprocessing step, mapping them into a feature space, and then performs an efficient search in the feature space for each query. Third, to guarantee precision, i.e., that any returned code change indeed matches the given query, we present a tree-based matching algorithm that checks whether a query can be expanded to a concrete code change. We present implementations for Java, JavaScript, and Python, and show that the approach responds within seconds to queries across one million code changes, has a recall of 80.7% for Java, 89.6% for Python, and 90.4% for JavaScript, enables users to find relevant code changes more effectively than a regular expression-based search, and is helpful for gathering a large-scale dataset of real-world bug fixes.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源