论文标题
如何解决一个新兴的话题?将强大和弱的标签结合起来的covid新闻
How to tackle an emerging topic? Combining strong and weak labels for Covid news NER
论文作者
论文摘要
能够训练名为“实体识别”模型的新兴主题模型对于许多现实世界应用程序至关重要,尤其是在医学领域中,新主题不断地从现有模型和数据集的范围中发展出来。为了进行现实的评估设置,我们介绍了一个新颖的Covid-19新闻NER数据集(Covidnews-ner),并发行了3000个手动注释的强烈标记的句子和13000个自动生成的弱标记句子的条目。除了数据集外,我们提出了Controster,这是一种通过转移学习来策略性地结合弱标签,以改善NER的秘诀。我们展示了其他对Covidnews-ner的有效性,同时提供了结合较弱和强的标签进行训练的分析。我们的主要发现是:(1)在调整强数据的强度优于仅在强或弱数据上训练的强数据之前,使用弱数据来制定初始主链。 (2)室外和内域弱标签训练的组合至关重要,并且在从单个来源上进行弱标签训练时可以克服饱和。
Being able to train Named Entity Recognition (NER) models for emerging topics is crucial for many real-world applications especially in the medical domain where new topics are continuously evolving out of the scope of existing models and datasets. For a realistic evaluation setup, we introduce a novel COVID-19 news NER dataset (COVIDNEWS-NER) and release 3000 entries of hand annotated strongly labelled sentences and 13000 auto-generated weakly labelled sentences. Besides the dataset, we propose CONTROSTER, a recipe to strategically combine weak and strong labels in improving NER in an emerging topic through transfer learning. We show the effectiveness of CONTROSTER on COVIDNEWS-NER while providing analysis on combining weak and strong labels for training. Our key findings are: (1) Using weak data to formulate an initial backbone before tuning on strong data outperforms methods trained on only strong or weak data. (2) A combination of out-of-domain and in-domain weak label training is crucial and can overcome saturation when being training on weak labels from a single source.