论文标题
自动生成主题标签
Automatic Generation of Topic Labels
论文作者
论文摘要
主题建模是一种流行的无监督方法,用于识别在信息检索中具有许多应用程序的文档集合中的基本主题。一个主题通常由以其概率排名的术语列表表示,但是,由于难以解释这些术语,因此已经开发出各种方法将描述性标签分配给主题。以前关于将标签自动分配到主题的工作依赖于两阶段的方法:(1)从大型池中检索候选标签(例如Wikipedia文章标题);然后(2)根据其与主题术语的语义相似性重新排名。但是,这些提取方法只能从限制集中分配候选标签,该标签可能不包括任何合适的标签。本文建议使用基于序列的神经方法来生成不受此限制的标签。该模型是通过使用远处监督创建的新的大型合成数据集进行了训练的。通过比较其生成的标签与人类评级的标签来评估该方法。
Topic modelling is a popular unsupervised method for identifying the underlying themes in document collections that has many applications in information retrieval. A topic is usually represented by a list of terms ranked by their probability but, since these can be difficult to interpret, various approaches have been developed to assign descriptive labels to topics. Previous work on the automatic assignment of labels to topics has relied on a two-stage approach: (1) candidate labels are retrieved from a large pool (e.g. Wikipedia article titles); and then (2) re-ranked based on their semantic similarity to the topic terms. However, these extractive approaches can only assign candidate labels from a restricted set that may not include any suitable ones. This paper proposes using a sequence-to-sequence neural-based approach to generate labels that does not suffer from this limitation. The model is trained over a new large synthetic dataset created using distant supervision. The method is evaluated by comparing the labels it generates to ones rated by humans.