论文标题
培训数据选择,以获得原子间电位的准确性和可传递性
Training Data Selection for Accuracy and Transferability of Interatomic Potentials
论文作者
论文摘要
机器学习的进步(ML)技术使原子质潜力的发展既有望,既有助于第一原理方法的准确性以及经验潜能的低成本,线性缩放和平行效率。尽管在过去几年中取得了迅速的进步,但基于ML的电位通常很难达到可转移性,也就是说,在与训练模型的配置方面提供了一致的精度。为了真正实现基于ML的原子间潜力的希望,必须开发系统的可扩展方法来生成各种训练集,以确保对原子环境空间的广泛覆盖。这项工作探讨了一种多样化的方法,该方法利用原子描述符的熵优化创建非常大的($> 2 \ cdot10^{5} $配置,$> 7 \ cdot10^{6} $ atomic环境)培训,以自动化的方式,不用任何人类介入。该数据集用于训练具有不同体系结构的多项式以及多个神经网络电位。为了进行比较,还对钨的专家策划数据集进行了培训。与专家策划的模型相比,经过熵优化数据训练的模型表现出较高的可传递性。此外,尽管经过重型用户输入(即领域专业知识)训练的模型在以类似配置进行测试时会产生最低的错误,但当对模型进行了多种多样的培训数据培训时,户外样本预测更加健壮。本文中,我们使用自动化和数据驱动的方法来证明准确和可转移的ML电势的开发,以产生大型和多样化的训练集。
Advances in machine learning (ML) techniques have enabled the development of interatomic potentials that promise both the accuracy of first principles methods and the low-cost, linear scaling, and parallel efficiency of empirical potentials. Despite rapid progress in the last few years, ML-based potentials often struggle to achieve transferability, that is, to provide consistent accuracy across configurations that significantly differ from those used to train the model. In order to truly realize the promise of ML-based interatomic potentials, it is therefore imperative to develop systematic and scalable approaches for the generation of diverse training sets that ensure broad coverage of the space of atomic environments. This work explores a diverse-by-construction approach that leverages the optimization of the entropy of atomic descriptors to create a very large ($>2\cdot10^{5}$ configurations, $>7\cdot10^{6}$ atomic environments) training set for tungsten in an automated manner, i.e., without any human intervention. This dataset is used to train polynomial as well as multiple neural network potentials with different architectures. For comparison, a corresponding family of potentials were also trained on an expert-curated dataset for tungsten. The models trained to entropy-optimized data exhibited vastly superior transferability compared to the expert-curated models. Furthermore, while the models trained with heavy user input (i.e., domain expertise) yield the lowest errors when tested on similar configurations, out-sample predictions are dramatically more robust when the models are trained on a deliberately diverse set of training data. Herein we demonstrate the development of both accurate and transferable ML potentials using automated and data-driven approaches for generating large and diverse training sets.