论文标题
在层次的潜在树分析中处理搭配的主题建模
Handling Collocations in Hierarchical Latent Tree Analysis for Topic Modeling
论文作者
论文摘要
近年来,主题建模一直是机器学习中最活跃的研究领域之一。层次的潜在树(HLTA)最近提出了用于分层主题建模的,并且显示出优于最先进方法的性能。但是,HLTA中使用的模型具有树结构,并且不能代表适当共享相同单词的多词表达式的不同含义。因此,我们提出了一种提取和选择搭配作为HLTA的预处理步骤的方法。在运行HLTA之前,选定的搭配被单个令牌替换为单个令牌。我们的经验评估表明,所提出的方法导致HLTA在测试的四个数据集中的三个集合中的性能更好。
Topic modeling has been one of the most active research areas in machine learning in recent years. Hierarchical latent tree analysis (HLTA) has been recently proposed for hierarchical topic modeling and has shown superior performance over state-of-the-art methods. However, the models used in HLTA have a tree structure and cannot represent the different meanings of multiword expressions sharing the same word appropriately. Therefore, we propose a method for extracting and selecting collocations as a preprocessing step for HLTA. The selected collocations are replaced with single tokens in the bag-of-words model before running HLTA. Our empirical evaluation shows that the proposed method led to better performance of HLTA on three of the four data sets tested.