论文标题

罗马:文本到视频检索的Expert Transfert的角色感知混合物

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

论文作者

Satar, Burak, Zhu, Hongyuan, Zhang, Hanwang, Lim, Joo Hwee

论文摘要

每天都会随着社交渠道的普及而上传视频。因此,通过用户文本查询检索最相关的视频内容起着更为重要的作用。大多数方法仅考虑在全局视觉和文本特征之间仅考虑一个联合嵌入空间,而无需考虑每种模态的局部结构。其他一些方法考虑了分别由全球和局部特征组成的多个嵌入空间,忽略了丰富的模式间相关性。 我们提出了一种新型的专家变压器罗马混合物,将文本和视频分为三个层次。空间上下文,时间上下文和对象上下文的角色。我们利用一种基于变压器的注意机制与专家的混合物完全利用全球和本地级别的视觉和文本嵌入,以考虑模式间和结构的相关性。结果表明,我们的方法的表现优于YouCook2和MSR-VTT数据集上的最先进方法,而没有预训练,但给定相同的视觉主链。最后,我们进行了广泛的消融研究,以阐明我们的设计选择。

Seas of videos are uploaded daily with the popularity of social channels; thus, retrieving the most related video contents with user textual queries plays a more crucial role. Most methods consider only one joint embedding space between global visual and textual features without considering the local structures of each modality. Some other approaches consider multiple embedding spaces consisting of global and local features separately, ignoring rich inter-modality correlations. We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels; the roles of spatial contexts, temporal contexts, and object contexts. We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels with mixture-of-experts for considering inter-modalities and structures' correlations. The results indicate that our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets, given the same visual backbone without pre-training. Finally, we conducted extensive ablation studies to elucidate our design choices.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源