论文标题
在视频理解中重新访问“视频”
Revisiting the "Video" in Video-Language Understanding
论文作者
论文摘要
是什么使视频任务适合视频,超出了单个图像可以理解的内容?在自我监督图像语言模型的最新进展的基础上,我们在视频和语言任务的背景下重新访问了这个问题。我们提出了Atgoral Probe(ATP),这是一个新的视频分析模型,该模型在受图像级理解约束的多模式模型的基线准确性上提供了更强的结合。通过将此模型应用于标准的歧视性视频和语言任务,例如视频问题回答和文本到视频检索,我们表征了当前视频语言基准的局限性和潜力。我们发现,即使与最近的大规模视频语言模型以及旨在基于更深层的视频级别的理解,对事件时间性的理解通常不是实现强大或最先进的性能的必要条件。我们还展示了ATP如何改善视频语言数据集和模型设计。我们描述了一种利用ATP的技术,以更好地解开具有更高挑战性数据的浓度浓度的数据集子集,从而提高了基准测试功效,以实现因果关系和时间的理解。此外,我们表明,将ATP有效地集成到完整的视频级时模型中可以提高效率和最先进的准确性。
What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.