论文标题
通过多模式洗牌变压器进行视频对话的动态图表学习
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
论文作者
论文摘要
鉴于输入视频,其关联的音频和简短的标题,视听场景意识对话框(AVSD)任务要求代理人与人类有关视听内容的问题进行询问答案对话框。因此,这项任务带来了一个具有挑战性的多模式表示学习和推理方案,进步可能会影响几种人机相互作用。为了解决这项任务,我们引入了语义控制的多模式洗牌变压器推理框架,该框架由一系列变压器模块组成,每个模块都以输入和产生以输入问题为条件的表示形式。我们提出的变压器变体在其多头输出上使用改组方案,显示出更好的正则化。为了编码细颗粒的视觉信息,我们提出了一个新颖的动态场景图表示学习管道,该图表由框架内推理层组成,为每个帧产生空间语义图表示,以及一个捕获时间提示的框架间聚合模块。我们的整个管道都是端到端训练的。我们在基准AVSD数据集上介绍了有关答案生成和选择任务的实验。我们的结果表明,所有评估指标的最新表现。
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.