在指定实体识别和建立NLP模型上注释Tweebank语料库以进行社交媒体分析

论文标题

在指定实体识别和建立NLP模型上注释Tweebank语料库以进行社交媒体分析

Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis

论文作者

Jiang, Hang, Hua, Yining, Beeferman, Doug, Roy, Deb

论文摘要

Twitter消息（“ Tweets”）等社交媒体数据由于其简短，嘈杂和口语性质，对NLP系统构成了一个特殊的挑战。诸如命名实体识别（NER）和句法解析之类的任务需要高度的域匹配培训数据才能获得良好的性能。迄今为止，NER和句法分析（例如，语音标记的一部分，依赖性解析的一部分）都没有完整的培训语料库。尽管有一些公开可用的注释的NLP数据集的推文，但它们仅专为单个任务而设计。在这项研究中，我们旨在创建Tweebank-ner，一种基于Tweebank V2（TB2）的英语NER语料库，在TB2上的Train先进（SOTA）推文NLP型号，并释放NLP管道，称为Twitter-Stanza。我们使用Amazon Mechanical Turk注释TB2中的命名实体并测量注释的质量。我们在TB2上训练STANZA管道，并与替代NLP框架（例如Flair，Spacy）和基于变压器的模型进行比较。 STANZA令牌和Lemmatizer在TB2上实现SOTA性能，而STANZA NER TAGGER，SPECH（POS）Tagger和Dependenty Parser对非转换模型的竞争性能实现了竞争性能。基于变压器的模型在Tweebank-ner中建立了强大的基线，并在POS标签和对TB2的依赖性解析中实现了新的SOTA性能。我们发布数据集，并使STANZA管道和基于Bertweet的模型可在未来的Tweet NLP研究中使用。我们的源代码，数据和预培训模型可在以下网址提供：\ url {https://github.com/social-machines/tweebanknlp}。

Social media data such as Twitter messages ("tweets") pose a particular challenge to NLP systems because of their short, noisy, and colloquial nature. Tasks such as Named Entity Recognition (NER) and syntactic parsing require highly domain-matched training data for good performance. To date, there is no complete training corpus for both NER and syntactic analysis (e.g., part of speech tagging, dependency parsing) of tweets. While there are some publicly available annotated NLP datasets of tweets, they are only designed for individual tasks. In this study, we aim to create Tweebank-NER, an English NER corpus based on Tweebank V2 (TB2), train state-of-the-art (SOTA) Tweet NLP models on TB2, and release an NLP pipeline called Twitter-Stanza. We annotate named entities in TB2 using Amazon Mechanical Turk and measure the quality of our annotations. We train the Stanza pipeline on TB2 and compare with alternative NLP frameworks (e.g., FLAIR, spaCy) and transformer-based models. The Stanza tokenizer and lemmatizer achieve SOTA performance on TB2, while the Stanza NER tagger, part-of-speech (POS) tagger, and dependency parser achieve competitive performance against non-transformer models. The transformer-based models establish a strong baseline in Tweebank-NER and achieve the new SOTA performance in POS tagging and dependency parsing on TB2. We release the dataset and make both the Stanza pipeline and BERTweet-based models available "off-the-shelf" for use in future Tweet NLP research. Our source code, data, and pre-trained models are available at: \url{https://github.com/social-machines/TweebankNLP}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题