多模式视频问题回答的分层条件关系网络

论文标题

多模式视频问题回答的分层条件关系网络

Hierarchical Conditional Relation Networks for Multimodal Video Question Answering

论文作者

Le, Thao Minh, Le, Vuong, Venkatesh, Svetha, Tran, Truyen

论文摘要

视频质量检查挑战了多个方面的建模者。建模视频不仅需要为动态视觉通道建立时空模型，还需要为相关信息渠道（例如字幕或音频）建立多模式结构。 Video QA至少增加了两层的复杂性 - 在语言查询的上下文中为每个通道选择相关内容，并构成时空概念和关系以响应查询。为了满足这些要求，我们从两个见解开始：（a）可以将内容选择和关系构造共同封装到条件计算结构中，并且（b）视频长度结构可以层次组成。对于（a），本文引入了一个通用的神经单元，称为条件关系网络（CRN）作为输入一组紧张对象，并转化为一组新的对象，这些对象编码输入的关系。 CRN的通用设计有助于通过简单的块堆叠来缓解视频QA的共同复杂模型构建过程，并具有灵活性，以适应两个不同域的输入方式和调节功能。结果，我们通过引入视频质量质量检查的层次结构条件关系网络（HCRN）来实现洞察力（b）。 HCRN主要旨在利用视频的视觉内容及其随附的频道的固有属性，以组成性，层次结构以及近期和遥远的关系。然后，将HCRN用于两种形式的视频质量检查，其中简短形式仅根据视觉内容来理解答案，而在呈现相关信息（例如字幕）的情况下进行了长色形式。我们严格的评估显示了对SOTA的一致性改进，包括大规模的现实世界数据集，例如TGIF-QA和TVQA，证明了我们CRN单元的强大功能以及HCRN的复杂域（例如视频QA）。

Video QA challenges modelers in multiple fronts. Modeling video necessitates building not only spatio-temporal models for the dynamic visual channel but also multimodal structures for associated information channels such as subtitles or audio. Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of the linguistic query, and composing spatio-temporal concepts and relations in response to the query. To address these requirements, we start with two insights: (a) content selection and relation construction can be jointly encapsulated into a conditional computational structure, and (b) video-length structures can be composed hierarchically. For (a) this paper introduces a general-reusable neural unit dubbed Conditional Relation Network (CRN) taking as input a set of tensorial objects and translating into a new set of objects that encode relations of the inputs. The generic design of CRN helps ease the common complex model building process of Video QA by simple block stacking with flexibility in accommodating input modalities and conditioning features across both different domains. As a result, we realize insight (b) by introducing Hierarchical Conditional Relation Networks (HCRN) for Video QA. The HCRN primarily aims at exploiting intrinsic properties of the visual content of a video and its accompanying channels in terms of compositionality, hierarchy, and near and far-term relation. HCRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, presented. Our rigorous evaluations show consistent improvements over SOTAs on well-studied benchmarks including large-scale real-world datasets such as TGIF-QA and TVQA, demonstrating the strong capabilities of our CRN unit and the HCRN for complex domains such as Video QA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题