代码：迈向分配偏移下的代码模型概括

论文标题

代码：迈向分配偏移下的代码模型概括

CodeS: Towards Code Model Generalization Under Distribution Shift

论文作者

Hu, Qiang, Guo, Yuejun, Xie, Xiaofei, Cordy, Maxime, Ma, Lei, Papadakis, Mike, Traon, Yves Le

论文摘要

由于出乎意料的准确性降解，分发转移一直是对深度学习（DL）模型的可靠部署（DL）的长期挑战。尽管DL已成为大规模代码时代大规模源代码分析的推动力，但在分配偏移分析和源代码任务基准测试方面取得了有限的进展。为了填补这一空白，本文启动了提出代码，即用于源代码学习的配电偏移基准数据集。具体而言，代码支持两种编程语言（Java和Python）和五种班次类型（任务，程序员，时间戳记，令牌和混凝土语法树）。基于代码的广泛实验表明，1）来自其他域的分布外检测器（例如，计算机视觉）并未推广到源代码，2）所有代码分类模型遭受分配变化的影响，3）基于表示的转移对模型的影响更高，而4）比Mimodal模型对分布移动相对更大。

Distribution shift has been a longstanding challenge for the reliable deployment of deep learning (DL) models due to unexpected accuracy degradation. Although DL has been becoming a driving force for large-scale source code analysis in the big code era, limited progress has been made on distribution shift analysis and benchmarking for source code tasks. To fill this gap, this paper initiates to propose CodeS, a distribution shift benchmark dataset, for source code learning. Specifically, CodeS supports two programming languages (Java and Python) and five shift types (task, programmer, time-stamp, token, and concrete syntax tree). Extensive experiments based on CodeS reveal that 1) out-of-distribution detectors from other domains (e.g., computer vision) do not generalize to source code, 2) all code classification models suffer from distribution shifts, 3) representation-based shifts have a higher impact on the model than others, and 4) pre-trained bimodal models are relatively more resistant to distribution shifts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题