论文标题
元数据引起的对比度学习,用于零击多标签文本分类
Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification
论文作者
论文摘要
大型多标签文本分类(LMTC)旨在将文档与大型候选人集的相关标签相关联。大多数现有的LMTC方法都依赖于大量的人类注销培训数据,这些数据通常是昂贵的,并且遭受了长尾标签的分布(即,在培训集中仅发生了几次标签)。在本文中,我们在零弹位设置下研究LMTC,该设置不需要带有标签的带注释的文档,仅依赖于标签表面名称和描述。为了训练分类器计算文档和标签之间的相似性评分,我们提出了一种新型的元数据诱导的对比度学习(MICOL)方法。与以前的基于文本的对比学习技术不同,Micol利用文档元数据(例如,作者,场地和研究论文的参考文献)在网络上广泛使用,以得出类似的文档文件对。两个大规模数据集的实验结果表明:(1)Micol的表现明显优于强零文本分类和对比度学习基准; (2)Micol与在10K-200K标记的文档进行培训的最先进的监督元数据感知的LMTC方法; (3)与监督的方法相比,米其型倾向于预测更少的标签,从而减轻长尾标签上的恶化性能。
Large-scale multi-label text classification (LMTC) aims to associate a document with its relevant labels from a large candidate set. Most existing LMTC approaches rely on massive human-annotated training data, which are often costly to obtain and suffer from a long-tailed label distribution (i.e., many labels occur only a few times in the training set). In this paper, we study LMTC under the zero-shot setting, which does not require any annotated documents with labels and only relies on label surface names and descriptions. To train a classifier that calculates the similarity score between a document and a label, we propose a novel metadata-induced contrastive learning (MICoL) method. Different from previous text-based contrastive learning techniques, MICoL exploits document metadata (e.g., authors, venues, and references of research papers), which are widely available on the Web, to derive similar document-document pairs. Experimental results on two large-scale datasets show that: (1) MICoL significantly outperforms strong zero-shot text classification and contrastive learning baselines; (2) MICoL is on par with the state-of-the-art supervised metadata-aware LMTC method trained on 10K-200K labeled documents; and (3) MICoL tends to predict more infrequent labels than supervised methods, thus alleviates the deteriorated performance on long-tailed labels.