DALG：图像检索的深入的本地和全球建模

论文标题

DALG：图像检索的深入的本地和全球建模

DALG: Deep Attentive Local and Global Modeling for Image Retrieval

论文作者

Song, Yuxin, Zhu, Ruolin, Yang, Min, He, Dongliang

论文摘要

深度学习的表示以检索方式实现了出色的图像检索性能。最新的最先进的单阶段模型将其融合到本地和全球特征，从而实现了效率和有效性之间有希望的权衡。但是，我们注意到，由于其多尺度推理范式，现有解决方案的效率仍受到限制。在本文中，我们遵循单阶段的艺术，并通过成功摆脱多尺度测试来获得进一步的复杂性效应平衡。为了实现这一目标，我们放弃了广泛使用的卷积网络，从而限制了探索各种视觉模式的局限性，并诉诸完全基于注意力的框架，以通过变形金刚的成功动机，以实现强大的表示学习。除了将变压器应用于全局特征提取外，我们还设计了一个本地分支，该分支由基于窗口的多头注意力和空间注意力组成，以完全利用本地图像模式。此外，我们建议通过跨意思模块组合层次结构的本地和全球特征，而不是像以前的艺术那样使用启发式融合。借助我们深入的本地和全球建模框架（DALG），广泛的实验结果表明，效率可以显着提高，同时通过艺术状况保持竞争成果。

Deeply learned representations have achieved superior image retrieval performance in a retrieve-then-rerank manner. Recent state-of-the-art single stage model, which heuristically fuses local and global features, achieves promising trade-off between efficiency and effectiveness. However, we notice that efficiency of existing solutions is still restricted because of their multi-scale inference paradigm. In this paper, we follow the single stage art and obtain further complexity-effectiveness balance by successfully getting rid of multi-scale testing. To achieve this goal, we abandon the widely-used convolution network giving its limitation in exploring diverse visual patterns, and resort to fully attention based framework for robust representation learning motivated by the success of Transformer. Besides applying Transformer for global feature extraction, we devise a local branch composed of window-based multi-head attention and spatial attention to fully exploit local image patterns. Furthermore, we propose to combine the hierarchical local and global features via a cross-attention module, instead of using heuristically fusion as previous art does. With our Deep Attentive Local and Global modeling framework (DALG), extensive experimental results show that efficiency can be significantly improved while maintaining competitive results with the state of the arts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题