Wukong-Reader：多模式的预训练，以了解细粒度的视觉文档理解

论文标题

Wukong-Reader：多模式的预训练，以了解细粒度的视觉文档理解

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

论文作者

Bai, Haoli, Liu, Zhiguang, Meng, Xiaojun, Li, Wentao, Liu, Shuang, Xie, Nian, Zheng, Rongfu, Wang, Liangwei, Hou, Lu, Wei, Jiansheng, Jiang, Xin, Liu, Qun

论文摘要

对数百万个数字出生或扫描的文档进行了无监督的预训练，显示了视觉文档理解的有希望的进步〜（VDU）。虽然在现有解决方案中研究了各种视觉的预训练目标，但到目前为止，很少探索文档文本文本文本线作为VDU的固有粒度。文档文本线通常包含在空间和语义上相关的单词，可以轻松从OCR发动机获得。在本文中，我们提出了Wukong-Reader，接受了新的训练预培训目标，以利用文档文本线中嵌套的结构知识。我们介绍了文本区域对比度学习，以实现视觉区域和文档文本文本之间的细粒度对齐。此外，蒙版区域建模和文本线网格匹配也旨在增强文本线的视觉和布局表示形式。实验表明，我们的Wukong-Reader在各种VDU任务（例如信息提取）上具有卓越的性能。对文本线的细粒度对齐也使Wukong-Reader具有有希望的本地化能力。

Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that our Wukong-Reader has superior performance on various VDU tasks such as information extraction. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题