论文标题
源代码注释:在代码克隆检测领域被忽略
Source Code Comments: Overlooked in the Realm of Code Clone Detection
论文作者
论文摘要
重复使用代码可以在代码存储库中产生重复或近乎删除的代码克隆。当前代码克隆检测技术(例如程序依赖图)依赖代码结构及其依赖项来检测克隆。这些技术使用大量的处理能力,时间和内存很昂贵。实际上,程序员经常使用代码注释来理解和重复使用代码,因为注释具有重要的域知识。但是当前的代码检测技术忽略了代码注释,这主要是由于英语的歧义。信息检索技术的最新进展可能有可能利用代码注释进行克隆检测。我们通过经验比较纯评论与仅源代码(无注释)检测克隆的准确性来研究了这一点,该量源代码包含315个类和27K代码行。为了在文件级别检测克隆,我们使用了主题建模技术,潜在的dirichlet分配,分析代码注释和抓钩(利用程序依赖图)来分析代码。我们的结果显示94.86召回和84.21精度,具有潜在的迪里奇分配,28.7召回和55.39使用graple精度。我们发现,在程序缺乏质量评论的情况下,潜在的Dirichlet分配产生了误报。但是,可以使用混合方法来解决此限制:在文件级别使用代码注释来减少克隆集,然后在方法级别使用基于程序依赖性图的技术来检测精确的克隆。我们在Java和Python软件包中的进一步分析Java Swing和Pygui发现了74.86 \%的召回,精度为84.21 \%。我们的发现要求在当前克隆检测技术中重新审查有关使用代码注释的假设。
Reusing code can produce duplicate or near-duplicate code clones in code repositories. Current code clone detection techniques, like Program Dependence Graphs, rely on code structure and their dependencies to detect clones. These techniques are expensive, using large amounts of processing power, time, and memory. In practice, programmers often utilize code comments to comprehend and reuse code, as comments carry important domain knowledge. But current code detection techniques ignore code comments, mainly due to the ambiguity of the English language. Recent advances in information retrieval techniques may have the potential to utilize code comments for clone detection. We investigated this by empirically comparing the accuracy of detecting clones with solely comments versus solely source code (without comments) on the JHotDraw package, which contains 315 classes and 27K lines of code. To detect clones at the file level, we used a topic modeling technique, Latent Dirichlet Allocation, to analyze code comments and GRAPLE -- utilizing Program Dependency Graph -- to analyze code. Our results show 94.86 recall and 84.21 precision with Latent Dirichlet Allocation and 28.7 recall and 55.39 precision using GRAPLE. We found Latent Dirichlet Allocation generated false positives in cases where programs lacked quality comments. But this limitation can be addressed by using a hybrid approach: utilizing code comments at the file level to reduce the clone set and then using Program Dependency Graph-based techniques at the method level to detect precise clones. Our further analysis across Java and Python packages, Java Swing and PyGUI, found a recall of 74.86\% and a precision of 84.21\%. Our findings call for reexamining the assumptions regarding the use of code comments in current clone detection techniques.