论文标题
在弱监督下分类的层次结构元数据分类
Hierarchical Metadata-Aware Document Categorization under Weak Supervision
论文作者
论文摘要
将文档分为给定标签层次结构,由于大规模文本语料库中的层次结构结构无处不在,具有直觉上的吸引力。尽管相关的研究在完全监督的层次文档分类中取得了令人满意的表现,但它们通常需要大量的人类宣传数据,并且仅利用文本信息。但是,在许多域中,(1)注释很少,在很少获得培训样本的地方; (2)文档附有元数据信息。因此,本文研究了如何在弱监督下整合标签层次结构,元数据和文本信号,以进行文档分类。我们开发了HimeCat,这是一个基于嵌入的生成框架,用于我们的任务。具体而言,我们提出了一个新颖的联合表示学习模块,该模块允许同时建模类别依赖性,元数据信息和文本语义,并引入了一个数据增强模块,该模块层次综合了培训文档,以补充原始的小规模训练集。我们的实验表明,HIMECAT对竞争基线的一致改进,并验证了我们表示学习和数据增强模块的贡献。
Categorizing documents into a given label hierarchy is intuitively appealing due to the ubiquity of hierarchical topic structures in massive text corpora. Although related studies have achieved satisfying performance in fully supervised hierarchical document classification, they usually require massive human-annotated training data and only utilize text information. However, in many domains, (1) annotations are quite expensive where very few training samples can be acquired; (2) documents are accompanied by metadata information. Hence, this paper studies how to integrate the label hierarchy, metadata, and text signals for document categorization under weak supervision. We develop HiMeCat, an embedding-based generative framework for our task. Specifically, we propose a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata information and textual semantics, and we introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set. Our experiments demonstrate a consistent improvement of HiMeCat over competitive baselines and validate the contribution of our representation learning and data augmentation modules.