如何解决一个新兴的话题？将强大和弱的标签结合起来的covid新闻

论文标题

如何解决一个新兴的话题？将强大和弱的标签结合起来的covid新闻

How to tackle an emerging topic? Combining strong and weak labels for Covid news NER

论文作者

Ficek, Aleksander, Liu, Fangyu, Collier, Nigel

论文摘要

能够训练名为“实体识别”模型的新兴主题模型对于许多现实世界应用程序至关重要，尤其是在医学领域中，新主题不断地从现有模型和数据集的范围中发展出来。为了进行现实的评估设置，我们介绍了一个新颖的Covid-19新闻NER数据集（Covidnews-ner），并发行了3000个手动注释的强烈标记的句子和13000个自动生成的弱标记句子的条目。除了数据集外，我们提出了Controster，这是一种通过转移学习来策略性地结合弱标签，以改善NER的秘诀。我们展示了其他对Covidnews-ner的有效性，同时提供了结合较弱和强的标签进行训练的分析。我们的主要发现是：（1）在调整强数据的强度优于仅在强或弱数据上训练的强数据之前，使用弱数据来制定初始主链。（2）室外和内域弱标签训练的组合至关重要，并且在从单个来源上进行弱标签训练时可以克服饱和。

Being able to train Named Entity Recognition (NER) models for emerging topics is crucial for many real-world applications especially in the medical domain where new topics are continuously evolving out of the scope of existing models and datasets. For a realistic evaluation setup, we introduce a novel COVID-19 news NER dataset (COVIDNEWS-NER) and release 3000 entries of hand annotated strongly labelled sentences and 13000 auto-generated weakly labelled sentences. Besides the dataset, we propose CONTROSTER, a recipe to strategically combine weak and strong labels in improving NER in an emerging topic through transfer learning. We show the effectiveness of CONTROSTER on COVIDNEWS-NER while providing analysis on combining weak and strong labels for training. Our key findings are: (1) Using weak data to formulate an initial backbone before tuning on strong data outperforms methods trained on only strong or weak data. (2) A combination of out-of-domain and in-domain weak label training is crucial and can overcome saturation when being training on weak labels from a single source.

下载PDF全文

下载文献需遵守相关版权规定

论文标题