在历史手稿中进行N-Gram斑点的一些镜头多代表方法

论文标题

在历史手稿中进行N-Gram斑点的一些镜头多代表方法

A Few Shot Multi-Representation Approach for N-gram Spotting in Historical Manuscripts

论文作者

De Gregorio, Giuseppe, Biswas, Sanket, Souibgui, Mohamed Ali, Bensalah, Asma, Lladós, Josep, Fornés, Alicia, Marcelli, Angelo

论文摘要

尽管自动文本识别的最新进展，但在历史手稿方面，表现仍然很温和。这主要是因为缺乏可用标记的数据来训练渴望数据的手写文本识别（HTR）模型。由于错误率的降低，关键字发现系统（KWS）为HTR提供了有效的HTR替代方案，但通常仅限于封闭的参考词汇。在本文中，我们提出了一些学习范式，以发现几个字符（N-gram）的序列，这些序列需要少量标记的培训数据。我们表明，对重要的n-gram的认识可以减少系统对词汇的依赖。在这种情况下，输入手写线图像中的一个副总动物（OOV）单词可能是属于词典的N-gram序列。对我们提出的多代表方法进行了广泛的实验评估。

Despite recent advances in automatic text recognition, the performance remains moderate when it comes to historical manuscripts. This is mainly because of the scarcity of available labelled data to train the data-hungry Handwritten Text Recognition (HTR) models. The Keyword Spotting System (KWS) provides a valid alternative to HTR due to the reduction in error rate, but it is usually limited to a closed reference vocabulary. In this paper, we propose a few-shot learning paradigm for spotting sequences of a few characters (N-gram) that requires a small amount of labelled training data. We exhibit that recognition of important n-grams could reduce the system's dependency on vocabulary. In this case, an out-of-vocabulary (OOV) word in an input handwritten line image could be a sequence of n-grams that belong to the lexicon. An extensive experimental evaluation of our proposed multi-representation approach was carried out on a subset of Bentham's historical manuscript collections to obtain some really promising results in this direction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题