论文标题
我们是否预计呢?深入挖掘粘性语言预读
Are we pretraining it right? Digging deeper into visio-linguistic pretraining
论文作者
论文摘要
最近的许多著作提出了预处理通用的粘性语言表示,然后对它们进行填充以进行下游视觉和语言任务。尽管体系结构和目标功能设计选择已引起人们的关注,但预处理数据集的选择很少受到关注。在这项工作中,我们质疑文献中一些默认选择。例如,我们系统地研究了预处理数据集域(文本和视觉)之间的相似性以及下游域如何影响性能。令人惊讶的是,我们表明,在接近下游任务的域中自动生成数据(例如,VQA V2)是预读的更好的选择,而不是“自然”数据,而是略有不同的域(例如,概念字幕)。另一方面,发现某些看似合理的数据集选择完全无效,对于某些下游任务是无效的。这表明,尽管最近做出了许多努力,但远见和语言训练的尚未“开箱即用”。总体而言,作为我们研究的副产品,我们发现预训练中的简单设计选择可以帮助我们在没有任何建筑变化的情况下实现下游任务的最新结果。
Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective function design choices have received attention, the choice of pretraining datasets has received little attention. In this work, we question some of the default choices made in literature. For instance, we systematically study how varying similarity between the pretraining dataset domain (textual and visual) and the downstream domain affects performance. Surprisingly, we show that automatically generated data in a domain closer to the downstream task (e.g., VQA v2) is a better choice for pretraining than "natural" data but of a slightly different domain (e.g., Conceptual Captions). On the other hand, some seemingly reasonable choices of pretraining datasets were found to be entirely ineffective for some downstream tasks. This suggests that despite the numerous recent efforts, vision & language pretraining does not quite work "out of the box" yet. Overall, as a by-product of our study, we find that simple design choices in pretraining can help us achieve close to state-of-art results on downstream tasks without any architectural changes.