蝙蝠：单个文档主题建模和细分的光谱双群集方法

论文标题

蝙蝠：单个文档主题建模和细分的光谱双群集方法

BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation

论文作者

Wu, Qiong, Hare, Adam, Wang, Sirui, Tu, Yuwei, Liu, Zhenming, Brinton, Christopher G., Li, Yanhua

论文摘要

现有的主题建模和文本细分方法通常需要大型数据集进行培训，从而限制其功能，而只有一小部分文本可用。在这项工作中，当有一个新的有趣文本时，我们重新检查了稀疏文档学习的“主题识别”和“文本细分”相关问题。在开发一种处理单个文档的方法时，我们面临两个主要挑战。首先是稀疏信息：仅访问一个文档，我们就无法培训传统主题模型或深度学习算法。其次是巨大的噪音：任何单个文档中的大部分单词都只会产生噪声，而无助于辨别主题或段。为了解决这些问题，我们设计了一种无监督的，有效的计算高效方法，称为蝙蝠：主题建模和分割的双簇方法。蝙蝠利用三个关键思想同时识别主题和段文本：（i）一种使用单词顺序信息来降低样本复杂性的新机制，（ii）基于统计的基于图形的双簇技术，可以识别单词和句子的潜在结构，并识别（iii）（iii）删除重要单词和授予重要单词的有效启发式的词来提高效果的效果。四个数据集的实验表明，在考虑主题连贯性，主题多样性，细分和运行时比较指标时，我们的方法的表现优于几个最先进的基线。

Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of "topic identification" and "text segmentation" for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms. Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments. To tackle these issues, we design an unsupervised, computationally efficient methodology called BATS: Biclustering Approach to Topic modeling and Segmentation. BATS leverages three key ideas to simultaneously identify topics and segment text: (i) a new mechanism that uses word order information to reduce sample complexity, (ii) a statistically sound graph-based biclustering technique that identifies latent structures of words and sentences, and (iii) a collection of effective heuristics that remove noise words and award important words to further improve performance. Experiments on four datasets show that our approach outperforms several state-of-the-art baselines when considering topic coherence, topic diversity, segmentation, and runtime comparison metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题