OCR验证濒危语言文本的校正

论文标题

OCR验证濒危语言文本的校正

OCR Post Correction for Endangered Language Texts

论文作者

Rijhwani, Shruti, Anastasopoulos, Antonios, Neubig, Graham

论文摘要

对于大多数濒临灭绝的语言，几乎没有可用的数据来构建自然语言处理模型。但是，这些语言中的文本数据通常以不可读取的格式存在，例如纸簿和扫描图像。在这项工作中，我们解决了从这些资源中提取文本的任务。我们创建了三种濒临灭绝的语言的扫描书籍的转录基准数据集，并对通用OCR工具的系统分析进行了系统的分析，该工具对濒危语言的数据筛选设置不鲁棒。我们开发了一种量身定制的OCR后校正方法，可在此数据筛选设置中简化培训，从而在三种语言中平均将识别错误率降低了34％。

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题