通过从分解现实世界中学习实现的场景文本综合引擎

论文标题

通过从分解现实世界中学习实现的场景文本综合引擎

A Scene-Text Synthesis Engine Achieved Through Learning from Decomposed Real-World Data

论文作者

Tang, Zhengmi, Miyazaki, Tomo, Omachi, Shinichiro

论文摘要

场景文本图像综合技术旨在自然地在背景场景上撰写文本实例，这对于训练深层神经网络非常吸引人，因为它们能够提供准确，全面的注释信息。先前的研究已经探索了使用从现实世界观测得出的规则上在二维和三维表面上生成合成的文本图像。其中一些研究提出了通过学习生成场景文本图像的。但是，由于没有合适的培训数据集，已经探索了无监督的框架，以从现有的现实世界数据中学习，这可能不会产生可靠的性能。为了缓解这一难题并促进基于学习的场景文本综合的研究，我们介绍了Depompst，这是一个由一些公共基准制备的现实世界数据集，其中包含三种类型的注释：四边形的Boxes，Streoke级别级别的文本掩码和文本图像。利用Depompst数据集，我们提出了一个基于学习的文本合成引擎（LBT），其中包括文本位置建议网络（TLPNET）和文本外观适应网络（TAANET）。 TLPNET首先预测了文本嵌入的合适区域，然后Taanet自适应地调整了文本实例的几何形状和颜色以匹配背景上下文。训练后，可以集成并利用这些网络来生成用于场景文本分析任务的合成数据集。进行了全面的实验，以验证所提出的LBT以及现有方法的有效性，并且实验结果表明，提出的LBT可以为场景文本检测器提供更好的预处理数据。

Scene-text image synthesis techniques that aim to naturally compose text instances on background scene images are very appealing for training deep neural networks due to their ability to provide accurate and comprehensive annotation information. Prior studies have explored generating synthetic text images on two-dimensional and three-dimensional surfaces using rules derived from real-world observations. Some of these studies have proposed generating scene-text images through learning; however, owing to the absence of a suitable training dataset, unsupervised frameworks have been explored to learn from existing real-world data, which might not yield reliable performance. To ease this dilemma and facilitate research on learning-based scene text synthesis, we introduce DecompST, a real-world dataset prepared from some public benchmarks, containing three types of annotations: quadrilateral-level BBoxes, stroke-level text masks, and text-erased images. Leveraging the DecompST dataset, we propose a Learning-Based Text Synthesis engine (LBTS) that includes a text location proposal network (TLPNet) and a text appearance adaptation network (TAANet). TLPNet first predicts the suitable regions for text embedding, after which TAANet adaptively adjusts the geometry and color of the text instance to match the background context. After training, those networks can be integrated and utilized to generate the synthetic dataset for scene text analysis tasks. Comprehensive experiments were conducted to validate the effectiveness of the proposed LBTS along with existing methods, and the experimental results indicate the proposed LBTS can generate better pretraining data for scene text detectors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题