论文标题
推文情感量化:实验重新评估
Tweet Sentiment Quantification: An Experimental Re-Evaluation
论文作者
论文摘要
情感量化是通过监督学习的任务,是与情感相关类的相对频率(也称为````PERCATION''')类的估计值(例如\ textsf {pastic},\ textsf {netral},\ \ \ textsf {nofic})在未出售文本的样本中。当这些文本是推文时,此任务尤其重要,因为在Twitter数据上进行的大多数情感分类工作的最终目标实际上是量化(而不是单个推文的分类)。众所周知,通过``分类和计数''求解量化(即,通过通过标准分类器对所有未标记的项目进行分类,并计算分配给给定类别的项目的准确性,并且存在更准确的量化方法。 Gao and Sebastiani(2016)对推特情感量化任务进行了量化方法的系统比较。事后看来,我们观察到该工作遵循的实验方案很薄弱,因此从结果中得出的结论的可靠性值得怀疑。现在,我们在完全相同的数据集上重新评估了这些量化方法(加上几种现代方法),这次是在现在合并的,更强大的实验协议之后(这也涉及在测试数据中模拟与训练集的阶级流行值的存在,在测试数据中非常不同)。该实验方案(即使不计算新添加的方法)涉及许多实验,比原始研究大5,775倍。我们的实验结果与Gao和Sebastiani获得的结果截然不同,它们对不同情感定量方法的相对优势和劣势提供了不同的,更牢固的理解。
Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called ``prevalence'') of sentiment-related classes (such as \textsf{Positive}, \textsf{Neutral}, \textsf{Negative}) in a sample of unlabelled texts. This task is especially important when these texts are tweets, since the final goal of most sentiment classification efforts carried out on Twitter data is actually quantification (and not the classification of individual tweets). It is well-known that solving quantification by means of ``classify and count'' (i.e., by classifying all unlabelled items by means of a standard classifier and counting the items that have been assigned to a given class) is less than optimal in terms of accuracy, and that more accurate quantification methods exist. Gao and Sebastiani (2016) carried out a systematic comparison of quantification methods on the task of tweet sentiment quantification. In hindsight, we observe that the experimental protocol followed in that work was weak, and that the reliability of the conclusions that were drawn from the results is thus questionable. We now re-evaluate those quantification methods (plus a few more modern ones) on exactly the same same datasets, this time following a now consolidated and much more robust experimental protocol (which also involves simulating the presence, in the test data, of class prevalence values very different from those of the training set). This experimental protocol (even without counting the newly added methods) involves a number of experiments 5,775 times larger than that of the original study. The results of our experiments are dramatically different from those obtained by Gao and Sebastiani, and they provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.