通过主动学习改善文本分类中的概率模型

论文标题

通过主动学习改善文本分类中的概率模型

Improving Probabilistic Models in Text Classification via Active Learning

论文作者

Bosley, Mitchell, Kuzushima, Saki, Enamorado, Ted, Shiraito, Yuki

论文摘要

社会科学家经常将文本文档分类以使用结果标签作为结果或实证研究的预测指标。自动化文本分类已成为标准工具，因为它需要较少的人体编码。但是，学者们仍然需要许多人类标记的文件来培训自动分类器。为了降低标签成本，我们提出了一种新的文本分类算法，将概率模型与主动学习结合在一起。概率模型同时使用标记和未标记的数据，而主动学习将其标记为难以对文档进行分类的努力。我们的验证研究表明，我们算法的分类性能与最先进的方法相媲美，而计算成本的一小部分。此外，我们复制了两篇最近发表的文章，并得出相同的实质性结论，其中只有一小部分在这些研究中使用的标记数据。我们提供ActiveText，这是一种开源软件来实现我们的方法。

Social scientists often classify text documents to use the resulting labels as an outcome or a predictor in empirical research. Automated text classification has become a standard tool, since it requires less human coding. However, scholars still need many human-labeled documents to train automated classifiers. To reduce labeling costs, we propose a new algorithm for text classification that combines a probabilistic model with active learning. The probabilistic model uses both labeled and unlabeled data, and active learning concentrates labeling efforts on difficult documents to classify. Our validation study shows that the classification performance of our algorithm is comparable to state-of-the-art methods at a fraction of the computational cost. Moreover, we replicate two recently published articles and reach the same substantive conclusions with only a small proportion of the original labeled data used in those studies. We provide activeText, an open-source software to implement our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题