论文标题
Twitter数据上无监督的文本表示方法的经验调查
An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data
论文作者
论文摘要
近年来,NLP领域取得了前所未有的成就。最值得注意的是,随着大规模预训练的基于变压器的语言模型(例如Bert)的出现,文本表示形式有了显着改善。但是,尚不清楚这些改进是否转化为嘈杂的用户生成的文本,例如推文。在本文中,我们介绍了针对嘈杂Twitter数据的文本聚类的多种众所周知的文本表示技术的实验调查。我们的结果表明,更高级的模型不一定在推文上效果最好,需要在该领域进行更多探索。
The field of NLP has seen unprecedented achievements in recent years. Most notably, with the advent of large-scale pre-trained Transformer-based language models, such as BERT, there has been a noticeable improvement in text representation. It is, however, unclear whether these improvements translate to noisy user-generated text, such as tweets. In this paper, we present an experimental survey of a wide range of well-known text representation techniques for the task of text clustering on noisy Twitter data. Our results indicate that the more advanced models do not necessarily work best on tweets and that more exploration in this area is needed.