Python的标记类型数据的大规模生成

论文标题

Python的标记类型数据的大规模生成

Large Scale Generation of Labeled Type Data for Python

论文作者

Abdelaziz, Ibrahim, Dolby, Julian, Srinivas, Kavitha

论文摘要

最近，诸如Python之类的动态类型语言已经获得了前所未有的流行。尽管这些语言减轻了对强制性类型注释的需求，但类型在程序理解和防止运行时错误中仍然起着至关重要的作用。一个有吸引力的选择是自动推断类型，以获得静态保证而无需编写类型。现有的推理技术主要依赖于静态打字工具，例如Pytype进行直接类型推理；最近，已经提出了神经类型的推断。但是，神经类型的推断是饥饿的数据，取决于基于静态键入收集标记的数据。但是，这种工具在推断用户定义类型方面差。此外，开发人员用这些语言的类型注释非常稀疏。在这项工作中，我们提出了新的技术，用于使用1）信息检索技术来生成高质量类型的技术，这些技术可以通过分析大型程序来库来提取类型良好的库来提取类型和2）使用模式。我们的结果表明，这些技术更精确，解决静态工具的弱点，并且可用于生成大型标记的数据集，以通过机器学习方法进行类型推理。与0.06的静态打字工具相比，我们的技术的F1分数为0.52-0.58，我们使用它们为700多个模块生成了37,000多种类型。

Recently, dynamically typed languages, such as Python, have gained unprecedented popularity. Although these languages alleviate the need for mandatory type annotations, types still play a critical role in program understanding and preventing runtime errors. An attractive option is to infer types automatically to get static guarantees without writing types. Existing inference techniques rely mostly on static typing tools such as PyType for direct type inference; more recently, neural type inference has been proposed. However, neural type inference is data hungry, and depends on collecting labeled data based on static typing. Such tools, however, are poor at inferring user defined types. Furthermore, type annotation by developers in these languages is quite sparse. In this work, we propose novel techniques for generating high quality types using 1) information retrieval techniques that work on well documented libraries to extract types and 2) usage patterns by analyzing a large repository of programs. Our results show that these techniques are more precise and address the weaknesses of static tools, and can be useful for generating a large labeled dataset for type inference by machine learning methods. F1 scores are 0.52-0.58 for our techniques, compared to static typing tools which are at 0.06, and we use them to generate over 37,000 types for over 700 modules.

下载PDF全文

下载文献需遵守相关版权规定

论文标题