BIST：视频接地对话的双向时空推理

论文标题

BIST：视频接地对话的双向时空推理

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

论文作者

Le, Hung, Sahoo, Doyen, Chen, Nancy F., Hoi, Steven C. H.

论文摘要

视频接地的对话非常具有挑战性，因为（i）视频的复杂性既包含空间和时间变化，又有（ii）用户话语的复杂性，这些视频的复杂性在多个对话转弯中查询了视频中不同段和/或不同对象的复杂性。但是，现有的视频对话方法的方法通常集中在浅表暂时的视觉提示上，但是从视频中忽略了更多细粒度的空间信号。为了解决这一缺点，我们提出了双向时空学习（BIST），这是基于文本提示的视频中高分辨率查询的视觉神经框架。具体而言，我们的方法不仅利用了空间和时间级信息，而且还通过空间到暂时的和时间到空间推理学习了两个特征空间之间的动态信息扩散。双向策略旨在解决对话环境中用户查询不断发展的语义。检索到的视觉提示被用作上下文信息，以构建对用户的相关响应。我们的经验结果和全面的定性分析表明，BIST可实现竞争性能，并在大规模的AVSD基准上产生合理的反应。我们还将BIST模型调整为视频质量质量检查设置，并在TGIF-QA基准测试中实现了优于先验的方法。

Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns. However, existing approaches to video-grounded dialogues often focus on superficial temporal-level visual cues, but neglect more fine-grained spatial signals from videos. To address this drawback, we propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues. Specifically, our approach not only exploits both spatial and temporal-level information, but also learns dynamic information diffusion between the two feature spaces through spatial-to-temporal and temporal-to-spatial reasoning. The bidirectional strategy aims to tackle the evolving semantics of user queries in the dialogue setting. The retrieved visual cues are used as contextual information to construct relevant responses to the users. Our empirical results and comprehensive qualitative analysis show that BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark. We also adapt our BiST models to the Video QA setting, and substantially outperform prior approaches on the TGIF-QA benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题