带有复杂布局的日本历史文档的大数据集

论文标题

带有复杂布局的日本历史文档的大数据集

A Large Dataset of Historical Japanese Documents with Complex Layouts

论文作者

Shen, Zejiang, Zhang, Kaixuan, Dell, Melissa

论文摘要

自动文档布局分析和内容提取的基于深度学习的方法有可能大规模地捕获历史文档中的丰富信息。一个主要的障碍是缺乏用于培训强大模型的大型数据集。特别是，亚洲语言的培训数据很少。为此，我们介绍了HJDATASET，这是一个具有复杂布局的日本历史文档的大量数据集。它包含七种类型的250,000个布局元素注释。除了内容区域的边界框和掩码外，它还还包括层次结构和布局元素的阅读订单。该数据集是通过人类和机器努力的组合来构建的。开发了一种基于半规则的方法来提取布局元素，并由人类检查员检查结果。所得的大规模数据集用于使用最新的深度学习模型为文本区域检测提供基线性能分析。我们演示了数据集对现实世界文档数字化任务的有用性。该数据集可从https://dell-research-harvard.github.io/hjdataset/获得。

Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale. One major hurdle is the lack of large datasets for training robust models. In particular, little training data exist for Asian languages. To this end, we present HJDataset, a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts. A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors. The resulting large-scale dataset is used to provide baseline performance analyses for text region detection using state-of-the-art deep learning models. And we demonstrate the usefulness of the dataset on real-world document digitization tasks. The dataset is available at https://dell-research-harvard.github.io/HJDataset/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题