论文标题
在360°视频中用于显着检测的全景变压器
Panoramic Vision Transformer for Saliency Detection in 360° Videos
论文作者
论文摘要
360 $^\ circ $视频显着性检测是360 $^\ cir视频理解的具有挑战性的基准之一,因为不可忽略的失真和不连续性发生在任何形式的360 $^\ circ $视频的预测中,并且在本质上是杂货方面的捕获方面的观点。我们提出了一个名为Panoramic Vision Transformer(摊铺机)的新框架。我们使用具有可变形卷积的Vision Transformer设计编码器,这不仅使我们不仅可以将正常视频介绍的模型插入我们的体系结构中,而无需其他模块或填充,而且只能执行几何近似值,这与以前的基于CNN的深层CNN方法不同。多亏了其功能强大的编码器,摊铺机可以通过本地补丁功能之间的三个简单相对关系来学习显着性,在没有监督或辅助信息(例如类激活)的情况下,通过较大的边缘来优于Wild360基准的最先进模型。我们通过VQA-ODV中的全向视频质量评估任务来证明我们的显着性预测模型的实用性,在那里我们始终在没有任何形式的监督(包括头部运动)的情况下提高性能。
360$^\circ$ video saliency detection is one of the challenging benchmarks for 360$^\circ$ video understanding since non-negligible distortion and discontinuity occur in the projection of any format of 360$^\circ$ videos, and capture-worthy viewpoint in the omnidirectional sphere is ambiguous by nature. We present a new framework named Panoramic Vision Transformer (PAVER). We design the encoder using Vision Transformer with deformable convolution, which enables us not only to plug pretrained models from normal videos into our architecture without additional modules or finetuning but also to perform geometric approximation only once, unlike previous deep CNN-based approaches. Thanks to its powerful encoder, PAVER can learn the saliency from three simple relative relations among local patch features, outperforming state-of-the-art models for the Wild360 benchmark by large margins without supervision or auxiliary information like class activation. We demonstrate the utility of our saliency prediction model with the omnidirectional video quality assessment task in VQA-ODV, where we consistently improve performance without any form of supervision, including head movement.