骆驼：图像字幕的平均教师学习

论文标题

骆驼：图像字幕的平均教师学习

CaMEL: Mean Teacher Learning for Image Captioning

论文作者

Barraco, Manuele, Stefanini, Matteo, Cornia, Marcella, Cascianelli, Silvia, Baraldi, Lorenzo, Cucchiara, Rita

论文摘要

用自然语言描述图像是迈向自动建模视觉和文本方式之间连接的基本步骤。在本文中，我们提出了骆驼，这是一种基于变压器的新型架构，用于图像字幕。我们提出的方法利用了两个相互联系的语言模型的相互作用，它们在培训阶段相互学习。两种语言模型之间的相互作用遵循了一种卑鄙的教师学习范式，并进行了知识蒸馏。在实验上，我们评估了在可可数据集上提出的解决方案的有效性，并与不同的视觉特征提取器结合使用。与现有建议相比，我们证明我们的模型提供了最新的标题质量，并且参数数量大大减少。根据苹果酒指标，我们在不使用外部数据的情况下训练时获得了可可的新最新状态。源代码和训练有素的模型可公开可用：https：//github.com/aimagelab/camel。

Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities. In this paper we present CaMEL, a novel Transformer-based architecture for image captioning. Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase. The interplay between the two language models follows a mean teacher learning paradigm with knowledge distillation. Experimentally, we assess the effectiveness of the proposed solution on the COCO dataset and in conjunction with different visual feature extractors. When comparing with existing proposals, we demonstrate that our model provides state-of-the-art caption quality with a significantly reduced number of parameters. According to the CIDEr metric, we obtain a new state of the art on COCO when training without using external data. The source code and trained models are publicly available at: https://github.com/aimagelab/camel.

下载PDF全文

下载文献需遵守相关版权规定

论文标题