基于BERT的自我监督文档聚类和数据增加

论文标题

基于BERT的自我监督文档聚类和数据增加

Self-supervised Document Clustering Based on BERT with Data Augment

论文作者

Shi, Haoxiang, Wang, Cen

论文摘要

对比度学习是一种无监督学习的有前途的方法，因为它在没有专用且复杂的模型设计的情况下继承了经过深入的深层模型的优势。在本文中，基于变形金刚的双向编码器表示，我们提出了自我保护的对比度学习（SCL）以及少数射击的对比度学习（FCL），并提供无监督的数据增强（UDA）（UDA）进行文本群集。 SCL的表现优于简短文本的最先进的无监督聚类方法，而在几种聚类评估措施方面，SCL的表现优于长期文本和长期文本的方法。 FCL实现了接近监督学习的表现，而使用UDA的FCL进一步提高了短文的性能。

Contrastive learning is a promising approach to unsupervised learning, as it inherits the advantages of well-studied deep models without a dedicated and complex model design. In this paper, based on bidirectional encoder representations from transformers, we propose self-supervised contrastive learning (SCL) as well as few-shot contrastive learning (FCL) with unsupervised data augmentation (UDA) for text clustering. SCL outperforms state-of-the-art unsupervised clustering approaches for short texts and those for long texts in terms of several clustering evaluation measures. FCL achieves performance close to supervised learning, and FCL with UDA further improves the performance for short texts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题