论文标题

Cartolabe:基于Web的大型文档集合的可伸缩可视化

Cartolabe: A Web-Based Scalable Visualization of Large Document Collections

论文作者

Philippe, Caillou, Jonas, Renault, Jean-Daniel, Fekete, Anne-Catherine, Letournel, Michèle, Sebag

论文摘要

我们描述了Cartolabe,这是一种基于Web的多尺度系统,可根据主题可视化和探索大型文本语料库,并引入了一种逐步可视化过滤查询的新型机制。最初设计的旨在代表和浏览不同学科的科学出版物,曲《卡托拉布》已演变为一个通用框架,并容纳各种语料库,从维基百科(450万个参赛作品)到法国国家辩论(430万参赛作品)。卡托拉布(Cartolabe)由两个模块组成:第一个依赖于自然语言处理方法,将语料库及其实体(文档,作者,概念)转换为高维矢量,计算其在2D平面上的投影,并为飞机区域提取有意义的标签。第二个模块是基于Web的可视化,使用U MAP投影方法显示了从语料库的多维投影计算出来的图块。该可视化模块旨在使用户在可视化和数据分析方面没有专业知识,以获取其语料库的概述,并与之互动:探索,查询,过滤,平移和放大语义兴趣区域。讨论了三种用例,以说明卡托拉布的多功能性以及将大规模文本语料库可视化和探索带给广大受众的能力。

We describe CARTOLABE, a web-based multi-scale system for visualizing and exploring large textual corpora based on topics, introducing a novel mechanism for the progressive visualization of filtering queries. Initially designed to represent and navigate through scientific publications in different disciplines, CARTOLABE has evolved to become a generic framework and accommodate various corpora, ranging from Wikipedia (4.5M entries) to the French National Debate (4.3M entries). CARTOLABE is made of two modules: the first relies on Natural Language Processing methods, converting a corpus and its entities (documents, authors, concepts) into high-dimensional vectors, computing their projection on the 2D plane, and extracting meaningful labels for regions of the plane. The second module is a web-based visualization, displaying tiles computed from the multidimensional projection of the corpus using the U MAP projection method. This visualization module aims at enabling users with no expertise in visualization and data analysis to get an overview of their corpus, and to interact with it: exploring, querying, filtering, panning and zooming on regions of semantic interest. Three use cases are discussed to illustrate CARTOLABE's versatility and ability to bring large scale textual corpus visualization and exploration to a wide audience.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源