论文标题

基于图的主题提取文本文档的向量嵌入:应用于新闻文章的语料库

Graph-based Topic Extraction from Vector Embeddings of Text Documents: Application to a Corpus of News Articles

论文作者

Altuncu, M. Tarik, Yaliraki, Sophia N., Barahona, Mauricio

论文摘要

新闻内容的生产以惊人的速度增长。为了帮助管理和监视庞大的文本量,越来越需要开发有效的方法来提供有关新兴内容领域的见解,并将非结构化的文本语料库分为“主题”,这些文本本质上源于内容相似性。在这里,我们提出了一个无监督的框架,该框架将自然语言处理中的强大矢量嵌入与多尺度图分区的工具一起汇总在一起,可以在不同的分辨率上揭示自然分区,而无需对语料库中的集群数量进行先验假设。我们通过与其他流行的聚类和主题建模方法的端到端比较来展示基于图的聚类的优势,还评估了不同的文本矢量嵌入,从经典词袋到DOC2VEC再到最近基于变形金刚的模型BERT。通过分析2016年总统大选期间美国新闻报道的分析来展示这项比较工作。

Production of news content is growing at an astonishing rate. To help manage and monitor the sheer amount of text, there is an increasing need to develop efficient methods that can provide insights into emerging content areas, and stratify unstructured corpora of text into `topics' that stem intrinsically from content similarity. Here we present an unsupervised framework that brings together powerful vector embeddings from natural language processing with tools from multiscale graph partitioning that can reveal natural partitions at different resolutions without making a priori assumptions about the number of clusters in the corpus. We show the advantages of graph-based clustering through end-to-end comparisons with other popular clustering and topic modelling methods, and also evaluate different text vector embeddings, from classic Bag-of-Words to Doc2Vec to the recent transformers based model Bert. This comparative work is showcased through an analysis of a corpus of US news coverage during the presidential election year of 2016.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源