论文标题
他们捕获了什么? - 源代码的预训练语言模型的结构分析
What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code
论文作者
论文摘要
最近,已经提出了许多用于源代码的预训练的语言模型来对代码的上下文进行建模,并作为下游代码智能任务(例如代码完成,代码搜索和代码摘要)的基础。这些模型利用掩盖的预训练和变压器,并取得了令人鼓舞的结果。但是,目前在现有预训练的代码模型的解释性方面仍然几乎没有进展。目前尚不清楚这些模型为什么有效以及它们可以捕获的功能相关性。在本文中,我们进行了彻底的结构分析,旨在从三个独特的角度对源代码(例如Codebert和GraphCodebert)进行预训练的语言模型的解释:(1)注意分析,(2)(2)对嵌入单词进行探测,以及(3)语法树的诱导。通过全面的分析,本文揭示了一些有见地的发现,这些发现可能会激发未来的研究:(1)注意力与代码的语法结构密切相符。 (2)代码的预训练语言模型可以在每个变压器层的中间表示中保留代码的语法结构。 (3)预先训练的代码模型具有诱导代码的语法树的能力。这些发现表明,将代码的语法结构纳入预培训的过程以获得更好的代码表示可能会有帮助。
Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.