用元数据对文本进行最小监督分类

论文标题

用元数据对文本进行最小监督分类

Minimally Supervised Categorization of Text with Metadata

论文作者

Zhang, Yu, Meng, Yu, Huang, Jiaxin, Xu, Frank F., Wang, Xuan, Han, Jiawei

论文摘要

旨在为每个文档分配主题标签的文档分类，在各种应用程序中都起着基本作用。尽管现有研究在常规监督的文档分类中取得了成功，但它们不关心两个真正的问题：（1）元数据的存在：在许多域中，文本伴随着各种其他信息，例如作者和标签。这样的元数据是引人入胜的主题指标，应将其利用到分类框架中；（2）标签稀缺性：在某些情况下仅需使用一小部分带注释的数据进行分类，在某些情况下获得标记的培训样本很昂贵。为了认识到这两个挑战，我们提出了Metacat，这是一个最小监督的框架，将文本与元数据进行分类。具体而言，我们开发了一个生成过程，描述了单词，文档，标签和元数据之间的关系。在生成模型的指导下，我们将文本和元数据嵌入相同的语义空间中，以编码异质信号。然后，基于相同的生成过程，我们合成训练样本以解决标签稀缺的瓶颈。我们对广泛的数据集进行了彻底的评估。实验结果证明了Metacat在许多竞争基线上的有效性。

Document categorization, which aims to assign a topic label to each document, plays a fundamental role in a wide variety of applications. Despite the success of existing studies in conventional supervised document classification, they are less concerned with two real problems: (1) the presence of metadata: in many domains, text is accompanied by various additional information such as authors and tags. Such metadata serve as compelling topic indicators and should be leveraged into the categorization framework; (2) label scarcity: labeled training samples are expensive to obtain in some cases, where categorization needs to be performed using only a small set of annotated data. In recognition of these two challenges, we propose MetaCat, a minimally supervised framework to categorize text with metadata. Specifically, we develop a generative process describing the relationships between words, documents, labels, and metadata. Guided by the generative model, we embed text and metadata into the same semantic space to encode heterogeneous signals. Then, based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity. We conduct a thorough evaluation on a wide range of datasets. Experimental results prove the effectiveness of MetaCat over many competitive baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题