将BERT与传统的机器学习文本分类进行比较

论文标题

将BERT与传统的机器学习文本分类进行比较

Comparing BERT against traditional machine learning text classification

论文作者

González-Carvajal, Santiago, Garrido-Merchán, Eduardo C.

论文摘要

近年来，BERT模型已成为一种流行的最先进的机器学习模型，该模型能够应对多个NLP任务，例如无监督的监督文本分类。它可以应付任何类型的语料库的灵活性，这使得这种方法不仅在学术界而且在行业中也很受欢迎。虽然，多年来，有许多不同的方法成功地使用了。在这项工作中，我们首先介绍了Bert，并对经典NLP方法进行了一些评论。然后，我们通过一套实验进行了经验测试，这些实验涉及不同场景的行为对机器学习算法的传统TF-IDF词汇的行为。这项工作的目的是添加经验证据，以支持或拒绝将BERT用作NLP任务的默认设置。实验表明了BERT的优势及其对NLP问题的特征的独立性，例如文本的语言增加了经验证据，以将BERT用作NLP问题中使用的默认技术。

The BERT model has arisen as a popular state-of-the-art machine learning model in the recent years that is able to cope with multiple NLP tasks such as supervised text classification without human supervision. Its flexibility to cope with any type of corpus delivering great results has make this approach very popular not only in academia but also in the industry. Although, there are lots of different approaches that have been used throughout the years with success. In this work, we first present BERT and include a little review on classical NLP approaches. Then, we empirically test with a suite of experiments dealing different scenarios the behaviour of BERT against the traditional TF-IDF vocabulary fed to machine learning algorithms. Our purpose of this work is to add empirical evidence to support or refuse the use of BERT as a default on NLP tasks. Experiments show the superiority of BERT and its independence of features of the NLP problem such as the language of the text adding empirical evidence to use BERT as a default technique to be used in NLP problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题