论文标题
解决自我监督的上下文化代码检索中的泄漏
Addressing Leakage in Self-Supervised Contextualized Code Retrieval
论文作者
论文摘要
我们解决上下文化的代码检索,搜索代码段的搜索有助于填补部分输入程序中的空白。我们的方法通过将源代码随机分为上下文和目标来促进大规模的自我监督对比培训。为了打击两者之间的泄漏,我们建议一种基于相互标识符掩盖,datentation和语法对准目标的选择的新方法。我们的第二个贡献是一个新数据集,用于基于代码克隆的手动对齐子通道的数据集直接评估上下文化代码检索。我们的实验表明,我们的方法可改善检索,并为代码克隆和缺陷检测产生新的最新结果。
We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that our approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection.