GraphGen：一种可扩展的方法，用于域 - 不知不线的标记图生成

论文标题

GraphGen：一种可扩展的方法，用于域 - 不知不线的标记图生成

GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation

论文作者

Goyal, Nikhil, Jain, Harsh Vardhan, Ranu, Sayan

论文摘要

在数据挖掘文献中已经对图生成模型进行了广泛的研究。尽管传统技术基于生成遵守预先决定分布的结构，但最近的技术已转向直接从数据中学习此分布。虽然基于学习的方法已经取得了显着提高质量，但仍有一些限制待解决。首先，学习图分布引入了其他计算开销，这将其可扩展性限制在大图数据库中。其次，许多技术仅学习结构，而不满足学习节点和边缘标签的需求，该标签编码重要的语义信息并影响结构本身。第三，现有技术通常包含特定领域的规则，并且缺乏普遍性。第四，由于使用弱评估指标或主要关注合成或小数据集，因此现有技术的实验不够全面。在这项工作中，我们开发了一种称为GraphGen的域 - 无形技术，以克服所有这些局限性。 GraphGen使用最小DFS代码将图形转换为序列。最小DFS代码是规范标签，并与标签信息完全捕获图形结构。结构和语义标签之间的复杂关节分布是通过新颖的LSTM结构来学习的。对百万尺寸的真实图数据集进行的广泛实验表明，GraphGen的平均速度比最先进的技术快4倍，同时在一组11种不同的指标中，质量的质量要高得多。我们的代码在https://github.com/idea-iitd/graphgen上发布。

Graph generative models have been extensively studied in the data mining literature. While traditional techniques are based on generating structures that adhere to a pre-decided distribution, recent techniques have shifted towards learning this distribution directly from the data. While learning-based approaches have imparted significant improvement in quality, some limitations remain to be addressed. First, learning graph distributions introduces additional computational overhead, which limits their scalability to large graph databases. Second, many techniques only learn the structure and do not address the need to also learn node and edge labels, which encode important semantic information and influence the structure itself. Third, existing techniques often incorporate domain-specific rules and lack generalizability. Fourth, the experimentation of existing techniques is not comprehensive enough due to either using weak evaluation metrics or focusing primarily on synthetic or small datasets. In this work, we develop a domain-agnostic technique called GraphGen to overcome all of these limitations. GraphGen converts graphs to sequences using minimum DFS codes. Minimum DFS codes are canonical labels and capture the graph structure precisely along with the label information. The complex joint distributions between structure and semantic labels are learned through a novel LSTM architecture. Extensive experiments on million-sized, real graph datasets show GraphGen to be 4 times faster on average than state-of-the-art techniques while being significantly better in quality across a comprehensive set of 11 different metrics. Our code is released at https://github.com/idea-iitd/graphgen.

下载PDF全文

下载文献需遵守相关版权规定

论文标题