论文标题
Snippext:具有增强数据的半监督意见挖掘
Snippext: Semi-supervised Opinion Mining with Augmented Data
论文作者
论文摘要
在线服务对意见采矿的解决方案感兴趣,这是从文本中提取方面,意见和情感的问题。一种挖掘意见的方法是利用预训练的语言模型的最新成功,可以微调从评论中获取高质量的提取。但是,微调语言模型仍然需要非平凡的培训数据。在本文中,我们研究了如何显着减少舆论挖掘的微调语言模型所需的标记培训数据量的问题。我们描述了Snippext,这是一种通过语言模型开发的意见挖掘系统,该模型通过使用增强数据的半监督学习进行了微调。 Snippext的新颖性是它巧妙地使用了两种统一的方法来实现最先进的(SOTA)性能,几乎没有标记的培训数据通过:(1)数据增强,以自动从现有的数据中自动产生更标记的培训数据,(2)半固定的学习技术,以利用大量未标记的数据以及LABERED DATAD的无标记数据。我们通过广泛的实验表明,Snippext的性能可相当,甚至可以超过以前的SOTA结果,其中几个意见挖掘任务只有一半的培训数据所需的一半。此外,当所有培训数据都杠杆化时,它可以实现新的SOTA结果。与基线管道相比,我们发现Snippext提取物明显更细粒度的意见,从而为下游应用程序提供了新的机会。
Online services are interested in solutions to opinion mining, which is the problem of extracting aspects, opinions, and sentiments from text. One method to mine opinions is to leverage the recent success of pre-trained language models which can be fine-tuned to obtain high-quality extractions from reviews. However, fine-tuning language models still requires a non-trivial amount of training data. In this paper, we study the problem of how to significantly reduce the amount of labeled training data required in fine-tuning language models for opinion mining. We describe Snippext, an opinion mining system developed over a language model that is fine-tuned through semi-supervised learning with augmented data. A novelty of Snippext is its clever use of a two-prong approach to achieve state-of-the-art (SOTA) performance with little labeled training data through: (1) data augmentation to automatically generate more labeled training data from existing ones, and (2) a semi-supervised learning technique to leverage the massive amount of unlabeled data in addition to the (limited amount of) labeled data. We show with extensive experiments that Snippext performs comparably and can even exceed previous SOTA results on several opinion mining tasks with only half the training data required. Furthermore, it achieves new SOTA results when all training data are leveraged. By comparison to a baseline pipeline, we found that Snippext extracts significantly more fine-grained opinions which enable new opportunities of downstream applications.