文档：一种在文档中提取层次结构的多模式方法，以了解一般形式

论文标题

文档：一种在文档中提取层次结构的多模式方法，以了解一般形式

DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding

论文作者

Wang, Zilong, Zhan, Mingjie, Liu, Xuebo, Liang, Ding

论文摘要

理解形式取决于文本内容和组织结构。尽管现代OCR表现良好，但要实现一般形式的理解仍然具有挑战性，因为形式通常是使用各种格式的。以前作品中的表检测和手工制作的功能由于其对格式的要求而不能适用于所有表格。因此，我们集中于最基本组件，键值对，并采用多模式方法提取特征。我们将形式的结构视为文本片段的类似树状或图形的层次结构。亲子关系对应于形式的键值对。我们利用最先进的模型和设计目标提取模块来从语义内容，布局信息和视觉图像中提取多模式特征。串联和特征转换的混合融合方法旨在融合异质特征并提供信息丰富的关节表示。我们在模型中也采用了不对称算法和阴性采样。我们验证了两个基准测试和FUNSD的方法，并且广泛的实验证明了我们方法的有效性。

Form understanding depends on both textual contents and organizational structure. Although modern OCR performs well, it is still challenging to realize general form understanding because forms are commonly used and of various formats. The table detection and handcrafted features in previous works cannot apply to all forms because of their requirements on formats. Therefore, we concentrate on the most elementary components, the key-value pairs, and adopt multimodal methods to extract features. We consider the form structure as a tree-like or graph-like hierarchy of text fragments. The parent-child relation corresponds to the key-value pairs in forms. We utilize the state-of-the-art models and design targeted extraction modules to extract multimodal features from semantic contents, layout information, and visual images. A hybrid fusion method of concatenation and feature shifting is designed to fuse the heterogeneous features and provide an informative joint representation. We adopt an asymmetric algorithm and negative sampling in our model as well. We validate our method on two benchmarks, MedForm and FUNSD, and extensive experiments demonstrate the effectiveness of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题