在变压器时代的图像字幕

论文标题

在变压器时代的图像字幕

Image Captioning In the Transformer Age

论文作者

Xu, Yang, Li, Li, Xu, Haiyang, Huang, Songfang, Huang, Fei, Cai, Jianfei

论文摘要

图像字幕（IC）通过将各种技术纳入CNN-RNN编码器架构结构中实现了惊人的发展。但是，由于CNN和RNN不共享基本的网络组件，因此很难端对端训练这种异质管道，而视觉编码器将不会从字幕监督中学到任何东西。这种缺点激发了研究人员开发一种均匀的体系结构，该体系结构有助于端到端培训，对于该培训，变压器是完美的培训，它在视觉和语言领域都证明了其巨大的潜力，因此可以用作IC管道中视觉编码器和语言解码器的基本组成部分。同时，自我监督的学习释放了变压器体系结构的力量，即预先训练的大规模训练可以推广到包括IC在内的各种任务。这些大型模型的成功似乎削弱了单个IC任务的重要性。但是，我们证明，IC通过分析IC与一些流行的自学学习范式之间的联系，在这个时代仍然具有其特定意义。由于页面限制，我们仅参考此简短调查中非常重要的论文，并且可以在https://github.com/sjokerlily/awsome-image-captioning上找到更多相关作品。

Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. However, since CNN and RNN do not share the basic network component, such a heterogeneous pipeline is hard to be trained end-to-end where the visual encoder will not learn anything from the caption supervision. This drawback inspires the researchers to develop a homogeneous architecture that facilitates end-to-end training, for which Transformer is the perfect one that has proven its huge potential in both vision and language domains and thus can be used as the basic component of the visual encoder and language decoder in an IC pipeline. Meantime, self-supervised learning releases the power of the Transformer architecture that a pre-trained large-scale one can be generalized to various tasks including IC. The success of these large-scale models seems to weaken the importance of the single IC task. However, we demonstrate that IC still has its specific significance in this age by analyzing the connections between IC with some popular self-supervised learning paradigms. Due to the page limitation, we only refer to highly important papers in this short survey and more related works can be found at https://github.com/SjokerLily/awesome-image-captioning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题