textVQA的结构化多模式关注

论文标题

textVQA的结构化多模式关注

Structured Multimodal Attentions for TextVQA

论文作者

Gao, Chenyu, Zhu, Qi, Wang, Peng, Li, Hui, Liu, Yuliang, Hengel, Anton van den, Wu, Qi

论文摘要

在本文中，我们提出了一个端到端结构化的多模式关注（SMA）神经网络，以主要解决上面的前两个问题。 SMA首先使用结构图表示来编码图像中出现的对象对象，对象文本和文本文本关系，然后设计一个多模式图注意网络来对其进行推理。最后，上述模块的输出通过全局 - 本地注意的答案模块处理，以通过以下M4C从OCR和一般词汇迭代中产生答案拼接在一起。我们提出的模型在TextVQA数据集上的SOTA模型和ST-VQA数据集的两个任务都优于除基于预训练的TAP之外的所有模型中的两个任务。具有强大的推理能力，它也赢得了TextVQA Challenge 2020的第一名。我们在几种推理模型上广泛测试了不同的OCR方法，并研究了逐渐提高OCR性能对TextVQA基准的影响。有了更好的OCR结果，不同的模型比VQA的准确性具有巨大的改进，但是我们的模型受益于强大的文本 - 视觉推理能力所祝福。为了授予我们的方法一个上限并为进一步的作品提供公平的测试基础，我们还为TextVQA数据集提供了人类通知的地面真相OCR注释，这在原始版本中没有给出。 https://github.com/chenyugao-cs/sma可用textVQA数据集的代码和地面OCR注释。

In this paper, we propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above. SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it. Finally, the outputs from the above modules are processed by a global-local attentional answering module to produce an answer splicing together tokens from both OCR and general vocabulary iteratively by following M4C. Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP. Demonstrating strong reasoning ability, it also won first place in TextVQA Challenge 2020. We extensively test different OCR methods on several reasoning models and investigate the impact of gradually increased OCR performance on TextVQA benchmark. With better OCR results, different models share dramatic improvement over the VQA accuracy, but our model benefits most blessed by strong textual-visual reasoning ability. To grant our method an upper bound and make a fair testing base available for further works, we also provide human-annotated ground-truth OCR annotations for the TextVQA dataset, which were not given in the original release. The code and ground-truth OCR annotations for the TextVQA dataset are available at https://github.com/ChenyuGAO-CS/SMA

下载PDF全文

下载文献需遵守相关版权规定

论文标题