论文标题
评估超参数对知识图嵌入质量的影响
Assessing the Effects of Hyperparameters on Knowledge Graph Embedding Quality
论文作者
论文摘要
将知识图嵌入到低维空间中是将方法(例如链接预测或节点分类)应用于这些数据库的流行方法。就计算时间和空间而言,这种嵌入过程非常昂贵。其部分原因是对超参数的优化,涉及从大型的高参数空间从随机,引导或蛮力选择中反复采样,并测试所得嵌入的质量。但是,并非该搜索空间中的所有超参数都同样重要。实际上,在事先了解超参数的相对重要性的情况下,可以完全从搜索中消除一些,而不会显着影响输出嵌入的整体质量。为此,我们运行了SOBOL灵敏度分析,以评估调整不同超参数对嵌入质量方差的影响。这是通过进行数千个嵌入试验来实现的,每次测量不同的超参数构型产生的嵌入质量。我们使用此模型为每个高参数生成SOBOL灵敏度指数,对这些超参数配置的嵌入质量进行了回归。通过评估SOBOL指数之间的相关性,我们发现知识图之间的超参数敏感性的显着差异,而不同的数据集特征是这些不一致的可能原因。作为这项工作的另一个贡献,我们确定了UMLS知识图中的几个关系,这些关系可能会通过反关系导致数据泄漏,并得出该图的泄漏射击变体并派生并存在UMLS-43。
Embedding knowledge graphs into low-dimensional spaces is a popular method for applying approaches, such as link prediction or node classification, to these databases. This embedding process is very costly in terms of both computational time and space. Part of the reason for this is the optimisation of hyperparameters, which involves repeatedly sampling, by random, guided, or brute-force selection, from a large hyperparameter space and testing the resulting embeddings for their quality. However, not all hyperparameters in this search space will be equally important. In fact, with prior knowledge of the relative importance of the hyperparameters, some could be eliminated from the search altogether without significantly impacting the overall quality of the outputted embeddings. To this end, we ran a Sobol sensitivity analysis to evaluate the effects of tuning different hyperparameters on the variance of embedding quality. This was achieved by performing thousands of embedding trials, each time measuring the quality of embeddings produced by different hyperparameter configurations. We regressed the embedding quality on those hyperparameter configurations, using this model to generate Sobol sensitivity indices for each of the hyperparameters. By evaluating the correlation between Sobol indices, we find substantial variability in the hyperparameter sensitivities between knowledge graphs, with differing dataset characteristics being the probable cause of these inconsistencies. As an additional contribution of this work we identify several relations in the UMLS knowledge graph that may cause data leakage via inverse relations, and derive and present UMLS-43, a leakage-robust variant of that graph.