通过多模式洗牌变压器进行视频对话的动态图表学习

论文标题

通过多模式洗牌变压器进行视频对话的动态图表学习

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

论文作者

Geng, Shijie, Gao, Peng, Chatterjee, Moitreya, Hori, Chiori, Roux, Jonathan Le, Zhang, Yongfeng, Li, Hongsheng, Cherian, Anoop

论文摘要

鉴于输入视频，其关联的音频和简短的标题，视听场景意识对话框（AVSD）任务要求代理人与人类有关视听内容的问题进行询问答案对话框。因此，这项任务带来了一个具有挑战性的多模式表示学习和推理方案，进步可能会影响几种人机相互作用。为了解决这项任务，我们引入了语义控制的多模式洗牌变压器推理框架，该框架由一系列变压器模块组成，每个模块都以输入和产生以输入问题为条件的表示形式。我们提出的变压器变体在其多头输出上使用改组方案，显示出更好的正则化。为了编码细颗粒的视觉信息，我们提出了一个新颖的动态场景图表示学习管道，该图表由框架内推理层组成，为每个帧产生空间语义图表示，以及一个捕获时间提示的框架间聚合模块。我们的整个管道都是端到端训练的。我们在基准AVSD数据集上介绍了有关答案生成和选择任务的实验。我们的结果表明，所有评估指标的最新表现。

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题