将树变成图形：自动代码通过简化的AST驱动图卷积网络进行评论

论文标题

将树变成图形：自动代码通过简化的AST驱动图卷积网络进行评论

Turn Tree into Graph: Automatic Code Review via Simplified AST Driven Graph Convolutional Network

论文作者

Wu, B., Liang, B., Zhang, X.

论文摘要

自动代码审查（ACR）可以减轻手动检查的成本，是软件工程中必不可少且必不可少的任务。为了处理ACR，现有工作是序列化抽象语法树（AST）。但是，通过序列编码方法使整个AST有意义是一项艰巨的任务，这主要是由于AST中的一些冗余节点阻碍了节点信息的传输。更不用说序列化表示不足以掌握AST中树结构的信息。在本文中，我们首先为ACR任务提供了新的大规模Apache自动代码评论（AACR）数据集（AACR）数据集，因为此任务中仍然没有公开可用的数据集。该数据集的发布将推动该领域的研究。基于它，我们提出了一个新颖的基于AST的基于AST的图形卷积网络（SIMAST-GCN）来处理ACR任务。具体来说，为了提高节点信息传播的效率，我们首先通过删除不包含连接属性的冗余节点来简化代码的AST，从而导致简化的AST。然后，我们基于简化的AST为每个代码构建一个关系图，以将树结构的代码片段之间的关系正确地体现在图中。随后，根据图形结构的优点，我们探索了一个图形卷积网络体系结构，该架构遵循注意机制，以利用代码片段的关键含义来推导代码表示。最后，我们在原始代码和修订的代码之间的表示中利用了简单但有效的减法操作，从而使经修订的差异最好被学到以确定ACR的结果。 AACR数据集的实验结果表明，我们提出的模型的表现优于最新方法。

Automatic code review (ACR), which can relieve the costs of manual inspection, is an indispensable and essential task in software engineering. To deal with ACR, existing work is to serialize the abstract syntax tree (AST). However, making sense of the whole AST with sequence encoding approach is a daunting task, mostly due to some redundant nodes in AST hinder the transmission of node information. Not to mention that the serialized representation is inadequate to grasp the information of tree structure in AST. In this paper, we first present a new large-scale Apache Automatic Code Review (AACR) dataset for ACR task since there is still no publicly available dataset in this task. The release of this dataset would push forward the research in this field. Based on it, we propose a novel Simplified AST based Graph Convolutional Network (SimAST-GCN) to deal with ACR task. Concretely, to improve the efficiency of node information dissemination, we first simplify the AST of code by deleting the redundant nodes that do not contain connection attributes, and thus deriving a Simplified AST. Then, we construct a relation graph for each code based on the Simplified AST to properly embody the relations among code fragments of the tree structure into the graph. Subsequently, in the light of the merit of graph structure, we explore a graph convolution networks architecture that follows an attention mechanism to leverage the crucial implications of code fragments to derive code representations. Finally, we exploit a simple but effective subtraction operation in the representations between the original and revised code, enabling the revised difference to be preferably learned for deciding the results of ACR. Experimental results on the AACR dataset illustrate that our proposed model outperforms the state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题