通过自我监督的语言建模进行冷启动的积极学习

论文标题

通过自我监督的语言建模进行冷启动的积极学习

Cold-start Active Learning through Self-supervised Language Modeling

论文作者

Yuan, Michelle, Lin, Hsuan-Tien, Boyd-Graber, Jordan

论文摘要

主动学习努力通过选择标签的最关键示例来降低注释成本。通常，主动学习策略取决于分类模型。例如，不确定性抽样取决于校准较差的模型置信度评分。在冷启动的环境中，由于模型不稳定性和数据稀缺，主动学习是不切实际的。幸运的是，现代NLP提供了其他信息来源：预训练的语言模型。训练前的损失可以找到使模型惊喜的例子，应标记为有效的微调。因此，我们将语言建模损失视为分类不确定性的代理。使用BERT，我们基于蒙版语言建模损失制定了一个简单的策略，从而最大程度地减少了文本分类的标签成本。与其他基线相比，我们的方法在更少的采样迭代和计算时间内达到了更高的精度。

Active learning strives to reduce annotation costs by choosing the most critical examples to label. Typically, the active learning strategy is contingent on the classification model. For instance, uncertainty sampling depends on poorly calibrated model confidence scores. In the cold-start setting, active learning is impractical because of model instability and data scarcity. Fortunately, modern NLP provides an additional source of information: pre-trained language models. The pre-training loss can find examples that surprise the model and should be labeled for efficient fine-tuning. Therefore, we treat the language modeling loss as a proxy for classification uncertainty. With BERT, we develop a simple strategy based on the masked language modeling loss that minimizes labeling costs for text classification. Compared to other baselines, our approach reaches higher accuracy within less sampling iterations and computation time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题