返回基础知识：使用结构信息的科学文献聚类

论文标题

返回基础知识：使用结构信息的科学文献聚类

Return to basics: Clustering of scientific literature using structural information

论文作者

Yun, Jinhyuk, Ahn, Sejung, Lee, June Young

论文摘要

学者经常采取相关性措施来估计两个不同项目（例如文件，作者和机构）之间的相似性。这种相关性措施通常是基于重叠的参考（$ \ textit {i.e。} $，参考书目耦合）或引用（$ \ textit {i.e。} $，共同引用），然后可以与集群分析一起使用，以在研究领域之间找到边界。不幸的是，计算相关性措施具有挑战性，尤其是对于大量项目，因为计算复杂性大于线性。我们提出了一种替代方法，用于识别使用受相关性措施启发的直接引用的研究方面。我们的新方法简单地将节点复制为两个不同的节点：引用节点和引用节点。然后，我们将典型的聚类方法应用于修改后的网络。引用节点的簇应效仿书目耦合相关性网络中的节点，而引用节点的簇应像共同引用相关性网络中的群体一样。在验证测试中，我们提出的方法证明了与常规相关性方法的相似性很高。我们还发现，所提出的方法的聚类结果优于基于基于自然语言处理的分类的常规相关性措施的措施。

Scholars frequently employ relatedness measures to estimate the similarity between two different items (e.g., documents, authors, and institutes). Such relatedness measures are commonly based on overlapping references ($\textit{i.e.}$, bibliographic coupling) or citations ($\textit{i.e.}$, co-citation) and can then be used with cluster analysis to find boundaries between research fields. Unfortunately, calculating a relatedness measure is challenging, especially for a large number of items, because the computational complexity is greater than linear. We propose an alternative method for identifying the research front that uses direct citation inspired by relatedness measures. Our novel approach simply replicates a node into two distinct nodes: a citing node and cited node. We then apply typical clustering methods to the modified network. Clusters of citing nodes should emulate those from the bibliographic coupling relatedness network, while clusters of cited nodes should act like those from the co-citation relatedness network. In validation tests, our proposed method demonstrated high levels of similarity with conventional relatedness-based methods. We also found that the clustering results of proposed method outperformed those of conventional relatedness-based measures regarding similarity with natural language processing--based classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题