论文标题
经过修订的视觉对话评估
A Revised Generative Evaluation of Visual Dialogue
论文作者
论文摘要
评估视觉对话时,回答与视觉输入有关的一系列问题的任务仍然是一项开放的研究挑战。 Visdial数据集的当前评估方案计算了预定义候选集中的地面答案等级,Massiceti等人。 (2018年)显示可能会受到数据集偏见的剥削。该方案也没有说明表达相同答案的不同方式 - 在NLP中对语言进行了充分研究的方面。我们为利用NLP文献的度量标准的Visdial数据集提出了修订的评估方案,以衡量模型产生的答案与一组相关答案之间的共识。我们使用基于相关性的简单有效的半监督方法构建了这些相关的答案集,这使我们能够自动扩展和扩展从人类到整个数据集的稀疏相关性注释。我们为修订的评估方案发布了这些集合和代码作为密集的,并打算在面对现有的约束和设计选择的情况下对数据集进行改进。
Evaluating Visual Dialogue, the task of answering a sequence of questions relating to a visual input, remains an open research challenge. The current evaluation scheme of the VisDial dataset computes the ranks of ground-truth answers in predefined candidate sets, which Massiceti et al. (2018) show can be susceptible to the exploitation of dataset biases. This scheme also does little to account for the different ways of expressing the same answer--an aspect of language that has been well studied in NLP. We propose a revised evaluation scheme for the VisDial dataset leveraging metrics from the NLP literature to measure consensus between answers generated by the model and a set of relevant answers. We construct these relevant answer sets using a simple and effective semi-supervised method based on correlation, which allows us to automatically extend and scale sparse relevance annotations from humans to the entire dataset. We release these sets and code for the revised evaluation scheme as DenseVisDial, and intend them to be an improvement to the dataset in the face of its existing constraints and design choices.