大规模知识图上检索算法的优化

论文标题

大规模知识图上检索算法的优化

Optimization of Retrieval Algorithms on Large Scale Knowledge Graphs

论文作者

Dörpinghaus, Jens, Stefan, Andreas

论文摘要

已显示知识图在最近的知识挖掘和发现中起着重要作用，例如在生命科学或生物信息学领域。尽管已经在查询优化，查询转换以及存储和检索大规模知识图的领域进行了大量研究，但算法优化领域仍然是使用图数据库的主要挑战，也是重要因素。很少有研究人员解决了在大规模标记的属性图上优化算法的问题。在这里，我们提出了两种优化方法，并将它们与直接查询图形数据库的天真方法进行比较。我们工作的目的是确定诸如Neo4J之类的图形数据库的限制因素，我们描述了应对这些挑战的新颖解决方案。为此，我们建议一个分类模式在图数据库上的问题的复杂性之间有所不同。我们在包含具有文本挖掘数据的富含生物医学出版物数据的知识图的测试系统上评估了我们的优化方法。该密集的图具有超过7100万个节点和850m的关系。结果非常令人鼓舞，并且 - 取决于问题 - 我们能够显示出44至3839之间的速度。

Knowledge graphs have been shown to play an important role in recent knowledge mining and discovery, for example in the field of life sciences or bioinformatics. Although a lot of research has been done on the field of query optimization, query transformation and of course in storing and retrieving large scale knowledge graphs the field of algorithmic optimization is still a major challenge and a vital factor in using graph databases. Few researchers have addressed the problem of optimizing algorithms on large scale labeled property graphs. Here, we present two optimization approaches and compare them with a naive approach of directly querying the graph database. The aim of our work is to determine limiting factors of graph databases like Neo4j and we describe a novel solution to tackle these challenges. For this, we suggest a classification schema to differ between the complexity of a problem on a graph database. We evaluate our optimization approaches on a test system containing a knowledge graph derived biomedical publication data enriched with text mining data. This dense graph has more than 71M nodes and 850M relationships. The results are very encouraging and - depending on the problem - we were able to show a speedup of a factor between 44 and 3839.

下载PDF全文

下载文献需遵守相关版权规定

论文标题