高度不平衡数据下的自适应名称实体识别

论文标题

高度不平衡数据下的自适应名称实体识别

Adaptive Name Entity Recognition under Highly Unbalanced Data

论文作者

Nguyen, Thong, Nguyen, Duy, Rao, Pramod

论文摘要

出于自然语言处理（NLP）的几个目的，例如信息提取，情感分析或聊天机器人，命名实体识别（NER）具有重要作用，因为它有助于确定和分类文本中的实体中的实体中的实体，例如，在此报告上，我们在此报告上的范围，我们在此报告上呈现了一个综合型号的人，例如，我们在此报告中呈现了一个综合型号，我们介绍了一定的经验。用于解决NER任务的双向LSTM（BI-LSTM）层。此外，我们还采用了嵌入向量（手套，伯特）的融合输入，这些输入在巨大的语料库中进行了预训练，以提高模型的概括能力。不幸的是，由于分布不平衡的交叉训练数据，两种方法都在较少的培训样品课程中表现出色。为了克服这一挑战，我们将附加分类模型引入了两个不同的集合：弱和强的类，然后正确设计了几个BI-LSTM-CRF模型，以优化每个集合的性能。我们在测试集上评估了我们的模型，并发现我们的方法可以通过使用非常小的数据集（大约0.45 \％）来显着提高弱类的性能。

For several purposes in Natural Language Processing (NLP), such as Information Extraction, Sentiment Analysis or Chatbot, Named Entity Recognition (NER) holds an important role as it helps to determine and categorize entities in text into predefined groups such as the names of persons, locations, quantities, organizations or percentages, etc. In this report, we present our experiments on a neural architecture composed of a Conditional Random Field (CRF) layer stacked on top of a Bi-directional LSTM (BI-LSTM) layer for solving NER tasks. Besides, we also employ a fusion input of embedding vectors (Glove, BERT), which are pre-trained on the huge corpus to boost the generalization capacity of the model. Unfortunately, due to the heavy unbalanced distribution cross-training data, both approaches just attained a bad performance on less training samples classes. To overcome this challenge, we introduce an add-on classification model to split sentences into two different sets: Weak and Strong classes and then designing a couple of Bi-LSTM-CRF models properly to optimize performance on each set. We evaluated our models on the test set and discovered that our method can improve performance for Weak classes significantly by using a very small data set (approximately 0.45\%) compared to the rest classes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题