Codebleu：一种自动评估代码合成的方法

论文标题

Codebleu：一种自动评估代码合成的方法

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

论文作者

Ren, Shuo, Guo, Daya, Lu, Shuai, Zhou, Long, Liu, Shujie, Tang, Duyu, Sundaresan, Neel, Zhou, Ming, Blanco, Ambrosio, Ma, Shuai

论文摘要

评估指标在区域的增长中起着至关重要的作用，因为它定义了区分好模型和坏模型的标准。在代码合成领域，常用的评估度量是BLEU或完美的准确性，但是它们不足以评估代码，因为BLEU最初旨在评估自然语言，忽略了代码的重要句法和语义特征，并且完美准确性太严格了，因此它低估了具有相同的半光学逻辑的不同输出。为了解决这个问题，我们引入了一个新的自动评估指标，称为Codebleu。它通过抽象语法树（AST）（AST）和通过Data-Flow通过抽象语法树（AST）和代码语义吸收BLEU的强度，并进一步注入代码语法。我们通过评估Codebleu和程序员在三个代码综合任务（即文本到代码，代码翻译和代码改进）上分配的Codebleu和质量分数之间的相关系数进行实验。实验结果表明，与BLEU和准确性相比，我们提出的Codebleu可以与分配的分数更好地相关。

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they are not suitable enough to evaluate codes, because BLEU is originally designed to evaluate the natural language, neglecting important syntactic and semantic features of codes, and perfect accuracy is too strict thus it underestimates different outputs with the same semantic logic. To remedy this, we introduce a new automatic evaluation metric, dubbed CodeBLEU. It absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow. We conduct experiments by evaluating the correlation coefficient between CodeBLEU and quality scores assigned by the programmers on three code synthesis tasks, i.e., text-to-code, code translation, and code refinement. Experimental results show that our proposed CodeBLEU can achieve a better correlation with programmer assigned scores compared with BLEU and accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题