TokenFlow：在视觉检索中重新思考细粒度的跨模式对齐

论文标题

TokenFlow：在视觉检索中重新思考细粒度的跨模式对齐

TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval

论文作者

Zou, Xiaohan, Wu, Changqiao, Cheng, Lele, Wang, Zhongyuan

论文摘要

视觉检索中的大多数现有方法是通过比较其全球特征向量的两种模式，这些矢量却错过了足够的信息并缺乏可解释性，检测图像或视频中的对象，并将文本与细粒度的特征对齐，这些特征依赖于复杂的模型设计，或者通过视觉和文本tokkens fackentual tokentual tokentual tokentual tokentual tokenter withers fackens toskens fackens fackens fackens from Susthior fromiors fromiors factious效率。为了解决这些局限性，一些最近的作品简单地汇总了代币的相似性以实现细粒度的对齐方式，但它们缺乏直观的解释，并且忽略了代币级别的特征和具有高级语义的全球表示之间的关系。在这项工作中，我们重新考虑了细粒度的跨模式对准，并为其设计了一种新的模型不合命相的配方。我们还揭开了最近的流行作品的神秘面纱，并将其纳入我们的计划。此外，受最佳运输理论的启发，我们引入了TokenFlow，这是对拟议方案的实例化。通过仅修改相似性函数，我们的方法的性能与主要视频文本检索基准测试的SOTA算法相当。可视化进一步表明，TokenFlow成功地利用了细粒度的信息并获得更好的解释性。

Most existing methods in vision-language retrieval match two modalities by either comparing their global feature vectors which misses sufficient information and lacks interpretability, detecting objects in images or videos and aligning the text with fine-grained features which relies on complicated model designs, or modeling fine-grained interaction via cross-attention upon visual and textual tokens which suffers from inferior efficiency. To address these limitations, some recent works simply aggregate the token-wise similarities to achieve fine-grained alignment, but they lack intuitive explanations as well as neglect the relationships between token-level features and global representations with high-level semantics. In this work, we rethink fine-grained cross-modal alignment and devise a new model-agnostic formulation for it. We additionally demystify the recent popular works and subsume them into our scheme. Furthermore, inspired by optimal transport theory, we introduce TokenFlow, an instantiation of the proposed scheme. By modifying only the similarity function, the performance of our method is comparable to the SoTA algorithms with heavy model designs on major video-text retrieval benchmarks. The visualization further indicates that TokenFlow successfully leverages the fine-grained information and achieves better interpretability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题