OCR综合基准数据集用于指示语言

论文标题

OCR综合基准数据集用于指示语言

OCR Synthetic Benchmark Dataset for Indic Languages

论文作者

Saini, Naresh, Pinto, Promodh, Bheemaraj, Aravinth, Kumar, Deepak, Daga, Dhiraj, Yadav, Saurabh, Nagaraj, Srihari

论文摘要

我们介绍了指示语言的最大公开可用的合成OCR基准数据集。该集合包含23种指示语言的总共90k图像及其基础真相。 OCR模型验证以指示语言验证需要大量的不同数据，以创建一个可靠的可靠模型。否则，生成如此大量的数据将很困难，但是使用合成数据，它变得更加容易。对于像计算机视觉或图像处理之类的字段，一旦开发了初始合成数据，模型创建就会变得更加容易。生成合成数据具有在需要时和在需要时调整其性质和环境的灵活性，以提高模型的性能。标记的实时数据的准确性有时非常昂贵，而合成数据的准确性可以通过良好的分数轻松实现。

We present the largest publicly available synthetic OCR benchmark dataset for Indic languages. The collection contains a total of 90k images and their ground truth for 23 Indic languages. OCR model validation in Indic languages require a good amount of diverse data to be processed in order to create a robust and reliable model. Generating such a huge amount of data would be difficult otherwise but with synthetic data, it becomes far easier. It can be of great importance to fields like Computer Vision or Image Processing where once an initial synthetic data is developed, model creation becomes easier. Generating synthetic data comes with the flexibility to adjust its nature and environment as and when required in order to improve the performance of the model. Accuracy for labeled real-time data is sometimes quite expensive while accuracy for synthetic data can be easily achieved with a good score.

下载PDF全文

下载文献需遵守相关版权规定

论文标题