论文标题
研究BI-LSTM和CRF,并用POS标签嵌入给印尼名为Entity Tagger的POS标签
Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger
论文作者
论文摘要
从几年前进行了对印尼名为“实体(NE)标签”的研究。但是,大多数人没有使用深度学习,而是采用了传统的机器学习算法,例如关联规则,支持矢量机,随机森林,幼稚的贝叶斯等。在这些研究中,提供了作为Gazetteers或线索词的单词列表来提高准确性。在这里,我们试图在印尼NE Tagger中使用深度学习。我们使用长期的短期内存(LSTM)作为拓扑,因为它是NE标记器的最先进。通过使用LSTM,我们不需要单词列表来提高准确性。基本上,我们研究了两件事。第一个是网络的输出层:SoftMax与条件随机场(CRF)。第二个是语音(POS)标签嵌入输入层的一部分。使用8400个句子作为训练数据和97个句子作为评估数据,我们发现将POS标签嵌入作为其他输入可以提高我们印尼NE Tagger的性能。至于SoftMax和CRF之间的比较,我们发现两个架构在对NE标签进行分类方面都有弱点。
Researches on Indonesian named entity (NE) tagger have been conducted since years ago. However, most did not use deep learning and instead employed traditional machine learning algorithms such as association rule, support vector machine, random forest, naïve bayes, etc. In those researches, word lists as gazetteers or clue words were provided to enhance the accuracy. Here, we attempt to employ deep learning in our Indonesian NE tagger. We use long short-term memory (LSTM) as the topology since it is the state-of-the-art of NE tagger. By using LSTM, we do not need a word list in order to enhance the accuracy. Basically, there are two main things that we investigate. The first is the output layer of the network: Softmax vs conditional random field (CRF). The second is the usage of part of speech (POS) tag embedding input layer. Using 8400 sentences as the training data and 97 sentences as the evaluation data, we find that using POS tag embedding as additional input improves the performance of our Indonesian NE tagger. As for the comparison between Softmax and CRF, we find that both architectures have a weakness in classifying an NE tag.