用于分析机器学习潜力中参考数据和描述符的垃圾箱和哈希方法

论文标题

用于分析机器学习潜力中参考数据和描述符的垃圾箱和哈希方法

A Bin and Hash Method for Analyzing Reference Data and Descriptors in Machine Learning Potentials

论文作者

Paleico, Martín Leandro, Behler, Jörg

论文摘要

近年来，机器学习（ML）电位（MLP）的发展已成为一个非常活跃的研究领域。已经提出了许多方法，可以在电子结构计算的计算成本的一小部分中对大型系统进行扩展模拟。现代ML电位成功的关键是原子相互作用的近第一原理质量描述。通过使用非常灵活的功能形式与电子结构计算中的高级参考数据结合使用来达到这种准确性。这些数据集可以包括数十万个涵盖数百万个原子环境的结构，以确保势能表面的所有相关特征都很好地表示。如今，处理如此大的数据集已成为ML电位构建的主要挑战之一。在本文中，我们提出了一种方法，即bin and-hash（bah）算法，通过实现大量多维矢量的有效识别和比较来克服此问题。在ML电势构建中，这种矢量在多种情况下出现。示例是比较本地原子环境，以识别并避免参考数据集中不必要的冗余信息，这些信息在电子结构计算以及训练过程中都是代价高昂的，评估描述符质量的结构指纹质量在许多类型的ML电位中用作结构指纹的质量，以及检测可能不可靠的数据点的检测。为了使用以原子为中心的对称函数进行原子环境的几何描述，为高维神经网络电位的示例说明了BAH算法，但是该方法是一般的，可以与任何当前类型的ML电位结合。

In recent years the development of machine learning (ML) potentials (MLP) has become a very active field of research. Numerous approaches have been proposed, which allow to perform extended simulations of large systems at a small fraction of the computational costs of electronic structure calculations. The key to the success of modern ML potentials is the close-to first principles quality description of the atomic interactions. This accuracy is reached by using very flexible functional forms in combination with high-level reference data from electronic structure calculations. These data sets can include up to hundreds of thousands of structures covering millions of atomic environments to ensure that all relevant features of the potential energy surface are well represented. The handling of such large data sets is nowadays becoming one of the main challenges in the construction of ML potentials. In this paper we present a method, the bin-and-hash (BAH) algorithm, to overcome this problem by enabling the efficient identification and comparison of large numbers of multidimensional vectors. Such vectors emerge in multiple contexts in the construction of ML potentials. Examples are the comparison of local atomic environments to identify and avoid unnecessary redundant information in the reference data sets that is costly in terms of both the electronic structure calculations as well as the training process, the assessment of the quality of the descriptors used as structural fingerprints in many types of ML potentials, and the detection of possibly unreliable data points. The BAH algorithm is illustrated for the example of high-dimensional neural network potentials using atom-centered symmetry functions for the geometrical description of the atomic environments, but the method is general and can be combined with any current type of ML potential.

下载PDF全文

下载文献需遵守相关版权规定

论文标题