基于深度学习的代码推荐人在多大程度上通过从培训集中克隆代码来产生预测？

论文标题

基于深度学习的代码推荐人在多大程度上通过从培训集中克隆代码来产生预测？

To What Extent do Deep Learning-based Code Recommenders Generate Predictions by Cloning Code from the Training Set?

论文作者

Ciniselli, Matteo, Pascarella, Luca, Bavota, Gabriele

论文摘要

深度学习（DL）模型已被广泛用于支持代码完成。这些模型一旦经过适当的训练，就可以作为输入不完整的代码组件（例如，功能不完整），并预测缺失的令牌以最终确定它。 Github Copilot是通过培训数百万开源存储库培训DL模型推荐的代码的示例：这些存储库的源代码充当培训数据，从而允许该模型学习“如何编程”。这种代码的使用通常由免费和开源软件（FOSS）许可进行调节，这些许可在哪些条件下可以重新分发或修改。截至今天，尚不清楚在开源代码上训练的DL模型生成的代码是否应被视为“新”或“衍生”工作，可能对侵犯许可证产生影响。在这项工作中，我们进行了一项大规模研究，研究了DL模型在推荐代码完成时倾向于从训练集中克隆代码的程度。这样的探索性研究可以帮助评估前面提到的潜在许可问题的幅度：如果这些模型倾向于生成培训集中看不见的新代码，则不太可能发生许可问题。否则，这些许可证的修订敦促调节这些模型产生的代码时应在商业环境中使用时如何处理。我们结果的重点表明，根据预测代码的大小，基于最新的DL代码完成工具所产生的预测的约10％至约0.1％是实例中的1型克隆。长期的预测不太可能克隆。

Deep Learning (DL) models have been widely used to support code completion. These models, once properly trained, can take as input an incomplete code component (e.g., an incomplete function) and predict the missing tokens to finalize it. GitHub Copilot is an example of code recommender built by training a DL model on millions of open source repositories: The source code of these repositories acts as training data, allowing the model to learn "how to program". The usage of such a code is usually regulated by Free and Open Source Software (FOSS) licenses, that establish under which conditions the licensed code can be redistributed or modified. As of Today, it is unclear whether the code generated by DL models trained on open source code should be considered as "new" or as "derivative" work, with possible implications on license infringements. In this work, we run a large-scale study investigating the extent to which DL models tend to clone code from their training set when recommending code completions. Such an exploratory study can help in assessing the magnitude of the potential licensing issues mentioned before: If these models tend to generate new code that is unseen in the training set, then licensing issues are unlikely to occur. Otherwise, a revision of these licenses urges to regulate how the code generated by these models should be treated when used, for example, in a commercial setting. Highlights from our results show that ~$10% to ~0.1% of the predictions generated by a state-of-the-art DL-based code completion tool are Type-1 clones of instances in the training set, depending on the size of the predicted code. Long predictions are unlikely to be cloned.

下载PDF全文

下载文献需遵守相关版权规定

论文标题