论文标题
search4code:使用弱监督的代码搜索意图分类
Search4Code: Code Search Intent Classification Using Weak Supervision
论文作者
论文摘要
开发人员使用搜索各种任务,例如查找代码,文档,调试信息等。特别是,开发人员在编码过程中大量使用了Web搜索来查找代码示例和摘要。最近,基于自然语言的代码搜索一直是研究的积极领域。但是,缺乏现实世界中的大规模数据集是一个重要的瓶颈。在这项工作中,我们提出了一种基于弱监督的方法,用于检测C#和Java编程语言搜索查询中的代码搜索意图。我们在由Bing Web搜索引擎中开采的100万个查询组成的现实世界数据集上评估了针对多个基线的方法,并表明基于CNN的模型可以分别达到C#和Java的精度为77%和76%。此外,我们还发布了搜索4Code,这是第一个从Bing Web搜索引擎挖掘出的代码搜索查询的大型现实数据集。我们希望数据集将帮助未来的代码搜索研究。
Developers use search for various tasks such as finding code, documentation, debugging information, etc. In particular, web search is heavily used by developers for finding code examples and snippets during the coding process. Recently, natural language based code search has been an active area of research. However, the lack of real-world large-scale datasets is a significant bottleneck. In this work, we propose a weak supervision based approach for detecting code search intent in search queries for C# and Java programming languages. We evaluate the approach against several baselines on a real-world dataset comprised of over 1 million queries mined from Bing web search engine and show that the CNN based model can achieve an accuracy of 77% and 76% for C# and Java respectively. Furthermore, we are also releasing Search4Code, the first large-scale real-world dataset of code search queries mined from Bing web search engine. We hope that the dataset will aid future research on code search.