处理ICT供应链的大型表格数据：一种多任务，机器解释方法

论文标题

处理ICT供应链的大型表格数据：一种多任务，机器解释方法

Handling big tabular data of ICT supply chains: a multi-task, machine-interpretable approach

论文作者

Xiao, Bin, Simsek, Murat, Kantarci, Burak, Alkheir, Ala Abu

论文摘要

由于信息和通信技术（ICT）产品的特征，ICT设备的关键信息通常以跨供应链共享的大型表格数据进行总结。因此，至关重要的是，用电子资产的飙升量自动解释表格结构。为了将电子文档中的表格数据转换为机器解释格式，并提供了布局和语义信息以进行信息提取和解释，我们定义了表结构识别（TSR）任务和表单元格类型分类（CTC）任务。我们使用图表代表TSR任务的复杂表结构。同时，根据CTC任务（即标头，属性和数据）的功能角色，将表单元格分为三组。随后，我们提出了一个多任务模型，以使用文本模态和图像模态特征同时解决定义的两个任务。我们的实验结果表明，我们提出的方法可以超过ICDAR2013和UNLV数据集的最先进方法。

Due to the characteristics of Information and Communications Technology (ICT) products, the critical information of ICT devices is often summarized in big tabular data shared across supply chains. Therefore, it is critical to automatically interpret tabular structures with the surging amount of electronic assets. To transform the tabular data in electronic documents into a machine-interpretable format and provide layout and semantic information for information extraction and interpretation, we define a Table Structure Recognition (TSR) task and a Table Cell Type Classification (CTC) task. We use a graph to represent complex table structures for the TSR task. Meanwhile, table cells are categorized into three groups based on their functional roles for the CTC task, namely Header, Attribute, and Data. Subsequently, we propose a multi-task model to solve the defined two tasks simultaneously by using the text modal and image modal features. Our experimental results show that our proposed method can outperform state-of-the-art methods on ICDAR2013 and UNLV datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题