从挖掘代码段中生成堆栈溢出的问题标题

论文标题

从挖掘代码段中生成堆栈溢出的问题标题

Generating Question Titles for Stack Overflow from Mined Code Snippets

论文作者

Gao, Zhipeng, Xia, Xin, Grundy, John, Lo, David, Li, Yuan-Fang

论文摘要

软件开发人员大量使用了堆栈溢出，作为一种流行的方式，可以通过互联网从同行那里寻求与编程相关的信息。堆栈溢出社区建议用户在创建一个问题以帮助他人更好地理解并提供帮助时提供相关代码段。先前的研究表明，这些问题中有很多人具有低质量，对堆栈溢出中的其他潜在专家没有吸引力。这些问题不佳的问题不太可能获得有用的答案，并阻碍了整体知识的产生和共享过程。考虑到引入低质量问题的原因之一，因此，许多开发人员可能无法澄清和总结其提出的代码片段背后的关键问题，因为他们缺乏与问题和/或不良写作技能相关的知识和术语，在本研究中，我们建议通过Quonsement Inspore for Ampomations seippet进行详细介绍量表，以帮助开发人员在编写高质量问题中进行详细研究。我们的方法是完全数据驱动的，并使用注意力机制来执行更好的内容选择，一种复制机制来处理稀有字问题，以及消除单词重复问题的覆盖机制。我们在堆栈溢出数据集上评估了各种编程语言（例如Python，Java，JavaScript，C＃和SQL）的方法，我们的实验结果表明，我们的方法显着优于自动和人类评估中几种最先进的盆地。我们发布了代码和数据集，以促进其他研究人员验证他们的想法并激发后续工作。

Stack Overflow has been heavily used by software developers as a popular way to seek programming-related information from peers via the internet. The Stack Overflow community recommends users to provide the related code snippet when they are creating a question to help others better understand it and offer their help. Previous studies have shown that} a significant number of these questions are of low-quality and not attractive to other potential experts in Stack Overflow. These poorly asked questions are less likely to receive useful answers and hinder the overall knowledge generation and sharing process. Considering one of the reasons for introducing low-quality questions in SO is that many developers may not be able to clarify and summarize the key problems behind their presented code snippets due to their lack of knowledge and terminology related to the problem, and/or their poor writing skills, in this study we propose an approach to assist developers in writing high-quality questions by automatically generating question titles for a code snippet using a deep sequence-to-sequence learning approach. Our approach is fully data-driven and uses an attention mechanism to perform better content selection, a copy mechanism to handle the rare-words problem and a coverage mechanism to eliminate word repetition problem. We evaluate our approach on Stack Overflow datasets over a variety of programming languages (e.g., Python, Java, Javascript, C# and SQL) and our experimental results show that our approach significantly outperforms several state-of-the-art baselines in both automatic and human evaluation. We have released our code and datasets to facilitate other researchers to verify their ideas and inspire the follow-up work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题