论文标题
统一的预科框架用于文档理解
Unified Pretraining Framework for Document Understanding
论文作者
论文摘要
Document Intelligence自动从文档中提取信息,并支持许多业务应用程序。关于大规模未标记的文档数据集的最新自我监督的学习方法已开辟了有希望的方向,以减少具有自我监督目标的培训模型的注释努力。但是,大多数现有的文档预处理方法仍然是语言主导的。我们提出UDOC,这是一个新的统一预审进框架,用于文档理解。 UDOC旨在支持大多数文档理解任务,从而将变压器扩展到将多模式嵌入作为输入。每个输入元素都由输入文档图像的语义区域的单词和视觉特征组成。 UDOC的一个重要特征是,它通过利用三个自我监督的损失来学习通用表示,鼓励表示模型句子,学习相似之处和对齐方式。广泛的经验分析表明,预处理程序学会了更好的联合表示,并导致下游任务的改进。
Document intelligence automates the extraction of information from documents and supports many business applications. Recent self-supervised learning methods on large-scale unlabeled document datasets have opened up promising directions towards reducing annotation efforts by training models with self-supervised objectives. However, most of the existing document pretraining methods are still language-dominated. We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. Each input element is composed of words and visual features from a semantic region of the input document image. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses, encouraging the representation to model sentences, learn similarities, and align modalities. Extensive empirical analysis demonstrates that the pretraining procedure learns better joint representations and leads to improvements in downstream tasks.