论文标题
Clotho-aqa:众包数据集用于音频问题
Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
论文作者
论文摘要
音频问题回答(AQA)是一项多模式翻译任务,其中系统分析音频信号和自然语言问题,以产生理想的自然语言答案。在本文中,我们介绍了Clotho-AQA,这是一个用于音频问题的数据集,该数据集由1991年音频文件组成,分别是从Clotho数据集选择的持续时间15至30秒之间的。对于每个音频文件,我们通过使用Amazon Mechanical Turk来收集六个不同的问题和相应的答案。问题和答案由不同的注释者产生。在每个音频的六个问题中,每个问题都被设计为“是”和“否”作为答案,而其余两个问题则具有其他单词答案。对于每个问题,我们都会从三个不同的注释者那里收集答案。我们还提出了两个基线实验,以描述数据集用于AQA任务的使用 - 基于LSTM的多模式二进制分类器,用于“是”或“否”类型答案以及基于LSTM的多模式多级分类器,用于828单字答案。二进制分类器的精度为62.7%,多级分类器的前1位准确度为54.2%,前5个精度为93.7%。 Clotho-AQA数据集可在https://zenodo.org/record/6473207上免费在线获取。
Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk. The questions and answers are produced by different annotators. Out of the six questions for each audio, two questions each are designed to have 'yes' and 'no' as answers, while the remaining two questions have other single-word answers. For each question, we collect answers from three different annotators. We also present two baseline experiments to describe the usage of our dataset for the AQA task - an LSTM-based multimodal binary classifier for 'yes' or 'no' type answers and an LSTM-based multimodal multi-class classifier for 828 single-word answers. The binary classifier achieved an accuracy of 62.7% and the multi-class classifier achieved a top-1 accuracy of 54.2% and a top-5 accuracy of 93.7%. Clotho-AQA dataset is freely available online at https://zenodo.org/record/6473207.