videoqa中时间定位的密集量匹配和框架选择门控

论文标题

videoqa中时间定位的密集量匹配和框架选择门控

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

论文作者

Kim, Hyounghun, Tang, Zineng, Bansal, Mohit

论文摘要

视频传达了丰富的信息。视频剪辑中存在动态时空的动态时空关系，多样化的多模式事件存在。因此，重要的是开发自动化模型可以准确地从视频中提取此类信息。在视频上回答问题是可以评估此类AI能力的任务之一。在本文中，我们提出了一个视频问题回答模型，该模型有效地集成了多模式输入来源，并找到了与时间相关的信息来回答问题。具体来说，我们首先采用密集的图像标题来帮助识别对象及其详细的显着区域和动作，因此为模型提供了有用的额外信息（以明确的文本格式以允许更轻松的匹配）来回答问题。此外，我们的模型还包括双重级别的注意（单词/对象和框架级别），用于不同来源的多头自我/交叉集成（视频和密集字幕），以及将更多相关信息传递给分类器的大门。最后，我们还将框架选择问题作为一个多标签分类任务，并介绍了两个损失功能，即框架得分率（IOFSM）和平衡的二进制跨透明膜（BBCE），以更好地监督人类重要性注释的模型。我们在具有挑战性的TVQA数据集上评估了我们的模型，在该数据集中，我们的每个模型组件都提供了可观的收益，并且我们的整体模型的表现优于最先进的利润率（74.09％对70.52％）。我们还提供了几个单词，对象和框架级别可视化研究。我们的代码可在以下网址公开获取：https：//github.com/hyounghk/videoqadensecapframegate-acl2020

Videos convey rich information. Dynamic spatio-temporal relationships between people/objects, and diverse multimodal events are present in a video clip. Hence, it is important to develop automated models that can accurately extract such information from videos. Answering questions on videos is one of the tasks which can evaluate such AI abilities. In this paper, we propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Specifically, we first employ dense image captions to help identify objects and their detailed salient regions and actions, and hence give the model useful extra information (in explicit textual format to allow easier matching) for answering questions. Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier. Finally, we also cast the frame selection problem as a multi-label classification task and introduce two loss functions, In-andOut Frame Score Margin (IOFSM) and Balanced Binary Cross-Entropy (BBCE), to better supervise the model with human importance annotations. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin (74.09% versus 70.52%). We also present several word, object, and frame level visualization studies. Our code is publicly available at: https://github.com/hyounghk/VideoQADenseCapFrameGate-ACL2020

下载PDF全文

下载文献需遵守相关版权规定

论文标题