少更多：剪辑功能上的线性层作为功能强大的Vizwiz模型

论文标题

少更多：剪辑功能上的线性层作为功能强大的Vizwiz模型

Less Is More: Linear Layers on CLIP Features as Powerful VizWiz Model

论文作者

Deuser, Fabian, Habel, Konrad, Rösch, Philipp J., Oswald, Norbert

论文摘要

当前用于多模式任务的体系结构，例如视觉问题回答的较高复杂性。结果，这些架构很难训练，需要高度的计算资源。为了解决这些问题，我们提出了一个基于夹的体系结构，该体系结构不需要对功能提取器进行任何微调。简单的线性分类器用于图像和文本编码器的串联特征。在训练过程中，添加了辅助损失，该辅助损失可在答案类型上运行。然后将结果分类用作答案类选择的注意门。在Vizwiz 2022视觉问题回答挑战中，我们在任务1上获得了60.15％的准确性：预测任务2：预测视觉问题的可回答性的视觉问题和AP得分为83.78％。

Current architectures for multi-modality tasks such as visual question answering suffer from their high complexity. As a result, these architectures are difficult to train and require high computational resources. To address these problems we present a CLIP-based architecture that does not require any fine-tuning of the feature extractors. A simple linear classifier is used on the concatenated features of the image and text encoder. During training an auxiliary loss is added which operates on the answer types. The resulting classification is then used as an attention gate on the answer class selection. On the VizWiz 2022 Visual Question Answering Challenge we achieve 60.15 % accuracy on Task 1: Predict Answer to a Visual Question and AP score of 83.78 % on Task 2: Predict Answerability of a Visual Question.

下载PDF全文

下载文献需遵守相关版权规定

论文标题