关于场景文本识别的词汇依赖

论文标题

关于场景文本识别的词汇依赖

On Vocabulary Reliance in Scene Text Recognition

论文作者

Wan, Zhaoyi, Zhang, Jielei, Zhang, Liang, Luo, Jiebo, Yao, Cong

论文摘要

在公共基准上追求高性能是在现场文本识别方面进行研究的推动力，并且取得了显着的进步。然而，一项仔细的调查表明，一个令人惊讶的事实是，最先进的方法在词汇中具有单词的图像上表现良好，但概括地概括为带有词汇外单词的图像。我们称这种现象为“词汇依赖”。在本文中，我们建立了一个分析框架，以对场景文本识别中词汇依赖问题进行深入研究。主要发现包括：（1）词汇依赖无处不在，即所有现有算法或多或少都具有这种特征；（2）基于注意力的解码器证明在概括词汇和基于分割的解码器外的单词方面表现出色，在利用视觉特征方面表现良好；（3）上下文建模与预测层高度耦合。这些发现提供了新的见解，并可以使场景文本识别的未来研究受益。此外，我们提出了一种简单而有效的相互学习策略，以允许两个家庭（基于注意力和基于细分）的模型进行协作。这种补救措施减轻了词汇依赖的问题，并改善了整体场景文本识别性能。

The pursuit of high performance on public benchmarks has been the driving force for research in scene text recognition, and notable progress has been achieved. However, a close investigation reveals a startling fact that the state-of-the-art methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary. We call this phenomenon "vocabulary reliance". In this paper, we establish an analytical framework to conduct an in-depth study on the problem of vocabulary reliance in scene text recognition. Key findings include: (1) Vocabulary reliance is ubiquitous, i.e., all existing algorithms more or less exhibit such characteristic; (2) Attention-based decoders prove weak in generalizing to words outside vocabulary and segmentation-based decoders perform well in utilizing visual features; (3) Context modeling is highly coupled with the prediction layers. These findings provide new insights and can benefit future research in scene text recognition. Furthermore, we propose a simple yet effective mutual learning strategy to allow models of two families (attention-based and segmentation-based) to learn collaboratively. This remedy alleviates the problem of vocabulary reliance and improves the overall scene text recognition performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题