论文标题
溶剂化自由能的转移学习:从量子化学到实验
Transfer learning for solvation free energies: from quantum chemistry to experiments
论文作者
论文摘要
数据稀缺,偏见和实验性噪声在将深度学习应用于化学和材料科学学科时经常遇到问题。事实证明,转移学习有效地补偿了数据不足。在机器学习中使用量子计算可以生成多种数据集,并确保学习对实验数据库固有的噪声的影响较小。在这项工作中,我们提出了一种转移学习方法,以预测溶剂化自由能,该方法将量子计算中的基本面与实验测量的较高准确性相结合。使用的模型结构基于定向传播神经网络,用于溶剂和溶质分子的分子嵌入。对于小型实验数据集和样本外预测,证明了预先计算中预训练的模型的重要优势。为新溶剂,新的溶质元素以及扩展到更高的摩尔质量溶质,显示了改进的样本外部性能。预训练模型的整体性能受到实验测试数据(称为质地不确定性)中噪声的限制。在随机测试拆分中,达到了0.21 kcal/mol的平均绝对误差。与量子计算的平均绝对误差(0.40 kcal/mol)相比,这是一个显着改善。如果根据实验数据的更准确的子集评估模型性能,则可以将误差进一步降低至0.09 kcal/mol。
Data scarcity, bias, and experimental noise are all frequently encountered problems in the application of deep learning to chemical and material science disciplines. Transfer learning has proven effective in compensating for the lack in data. The use of quantum calculations in machine learning enables the generation of a diverse dataset and ensures that learning is less affected by noise inherent to experimental databases. In this work, we propose a transfer learning approach for the prediction of solvation free energies that combines fundamentals from quantum calculations with the higher accuracy of experimental measurements. The employed model architecture is based on the directed-message passing neural network for the molecular embedding of solvent and solute molecules. A significant advantage of models pre-trained on quantum calculations is demonstrated for small experimental datasets and for out-of-sample predictions. The improved out-of-sample performance is shown for new solvents, for new solute elements, and for the extension to higher molar mass solutes. The overall performance of the pre-trained models is limited by the noise in the experimental test data, known as the aleatoric uncertainty. On a random test split, a mean absolute error of 0.21 kcal/mol is achieved. This is a significant improvement compared to the mean absolute error of the quantum calculations (0.40 kcal/mol). The error can be further reduced to 0.09 kcal/mol if the model performance is assessed on a more accurate subset of the experimental data.