论文标题
TOP2VEC:主题的分布式表示
Top2Vec: Distributed Representations of Topics
论文作者
论文摘要
主题建模用于在大量文档中发现潜在的语义结构,通常称为主题。使用最广泛的方法是潜在的dirichlet分配和概率潜在的语义分析。尽管他们很受欢迎,但它们还是有几个弱点。为了获得最佳的结果,他们通常需要已知的主题数量,自定义的停止字列表,Stemming和Lemmatization。此外,这些方法依赖于单词的表示文档的表示,这些文档忽略了单词的顺序和语义。文档和单词的分布式表示形式由于能够捕获单词和文档语义的能力而获得了普及。我们提出$ \ texttt {top2vec} $,它利用联合文档和单词语义嵌入找到$ \ textit {topic vectors} $。该模型不需要停止字列表,茎或诱饵,并且会自动找到主题的数量。由此产生的主题向量与文档和单词向量共同嵌入,它们之间的距离代表语义相似性。我们的实验表明,$ \ texttt {top2vec} $找到了与概率生成模型相比,受过训练的语料库的信息和代表。
Topic modeling is used for discovering latent semantic structure, usually referred to as topics, in a large collection of documents. The most widely used methods are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. Despite their popularity they have several weaknesses. In order to achieve optimal results they often require the number of topics to be known, custom stop-word lists, stemming, and lemmatization. Additionally these methods rely on bag-of-words representation of documents which ignore the ordering and semantics of words. Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents. We present $\texttt{top2vec}$, which leverages joint document and word semantic embedding to find $\textit{topic vectors}$. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that $\texttt{top2vec}$ finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models.