通过连接性和内容相关性过滤嘈杂的对话中心语料库

论文标题

通过连接性和内容相关性过滤嘈杂的对话中心语料库

Filtering Noisy Dialogue Corpora by Connectivity and Content Relatedness

论文作者

Akama, Reina, Yokoi, Sho, Suzuki, Jun, Inui, Kentaro

论文摘要

大规模对话数据集最近已用于培训神经对话代理。但是，据报道，这些数据集包含不可接受的不可接受的话语对。在本文中，我们提出了一种根据连通性和相关性来评分话语对质量的方法。提出的评分方法是根据对话和语言学研究社区中广泛共享的发现设计的。我们证明，它与人类对话质量的判断有着相对良好的相关性。此外，该方法可用于从大规模嘈杂的对话语料库中滤除潜在的不可接受的话语对，以确保其质量。我们通过实验证实，通过所提出的方法过滤的训练数据可改善响应产生中神经对话剂的质量。

Large-scale dialogue datasets have recently become available for training neural dialogue agents. However, these datasets have been reported to contain a non-negligible number of unacceptable utterance pairs. In this paper, we propose a method for scoring the quality of utterance pairs in terms of their connectivity and relatedness. The proposed scoring method is designed based on findings widely shared in the dialogue and linguistics research communities. We demonstrate that it has a relatively good correlation with the human judgment of dialogue quality. Furthermore, the method is applied to filter out potentially unacceptable utterance pairs from a large-scale noisy dialogue corpus to ensure its quality. We experimentally confirm that training data filtered by the proposed method improves the quality of neural dialogue agents in response generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题