论文标题
VisualWordGrid:使用多模式方法从扫描文档中提取信息
VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach
论文作者
论文摘要
我们介绍了一种新颖的方法,用于扫描文档表示以执行现场提取。它允许在用作分割模型的输入的3轴张量中同时编码文本,视觉和布局信息。我们首先要考虑到视觉模态,然后通过增强其在小数据集中的稳健性,同时保持推理时间较低,从而改善了最近的Chargrid和WordGrid \ cite {Chargrid}模型,然后通过增强其稳健性。我们的方法在公共和私人文档图像数据集上进行了测试,与最近的最新方法相比,表现更高。
We introduce a novel approach for scanned document representation to perform field extraction. It allows the simultaneous encoding of the textual, visual and layout information in a 3-axis tensor used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid \cite{chargrid} models in several ways, first by taking into account the visual modality, then by boosting its robustness in regards to small datasets while keeping the inference time low. Our approach is tested on public and private document-image datasets, showing higher performances compared to the recent state-of-the-art methods.