在大型Scala代码库上的克隆检测

论文标题

在大型Scala代码库上的克隆检测

Clone Detection on Large Scala Codebases

论文作者

Rahman, Wahidur, Xu, Yisen, Pu, Fan, Xuan, Jifeng, Jia, Xiangyang, Basios, Michail, Kanthan, Leslie, Li, Lingbo, Wu, Fan, Xu, Baowen

论文摘要

代码克隆是相同或相似的代码段。代码克隆的广泛存在会增加维护成本并危害软件的质量。研究界已经开发了许多检测代码克隆的技术，但是，几乎没有证据表明这些技术在工业用例中的性能。在本文中，我们旨在发现当在工业用例中应用此类技术时的差异。我们对开源项目和用Scala语言编写的两种最先进的代码克隆检测技术（Sourcerercc和Autoencode）进行了大规模的实验研究。我们的结果表明，两种算法在工业项目上的表现都不同，精度最大的下降为30.7 \％，召回率最大的增加为32.4 \％。通过其开发人员手动标记工业项目的样本，我们发现上述项目中的3型克隆要比开源项目中的较少。

Code clones are identical or similar code segments. The wide existence of code clones can increase the cost of maintenance and jeopardise the quality of software. The research community has developed many techniques to detect code clones, however, there is little evidence of how these techniques may perform in industrial use cases. In this paper, we aim to uncover the differences when such techniques are applied in industrial use cases. We conducted large scale experimental research on the performance of two state-of-the-art code clone detection techniques, SourcererCC and AutoenCODE, on both open source projects and an industrial project written in the Scala language. Our results reveal that both algorithms perform differently on the industrial project, with the largest drop in precision being 30.7\%, and the largest increase in recall being 32.4\%. By manually labelling samples of the industrial project by its developers, we discovered that there are substantially less Type-3 clones in the aforementioned project than that in the open source projects.

下载PDF全文

下载文献需遵守相关版权规定

论文标题