论文标题
molcpt:分子连续及时调整以概括分子表示学习
MolCPT: Molecule Continuous Prompt Tuning to Generalize Molecular Representation Learning
论文作者
论文摘要
分子表示学习对于分子性质预测的问题至关重要,其中图神经网络(GNN)由于其结构建模能力而作为有效解决方案。由于标记的数据通常很少且获得昂贵,因此在广泛的分子空间中概括GNN是一个巨大的挑战。最近,已利用“预训练,微调”的训练范式来提高GNN的概括能力。它使用自我监督的信息预先培训GNN,然后进行微调以优化下游任务,仅使用几个标签。但是,预训练并不总是会在统计上显着改善,尤其是对于随机结构掩蔽的自我监督学习。实际上,分子结构的特征是基序子图,这些图通常发生并影响分子特性。为了利用与任务相关的主题,我们提出了一个新的“预训练,及时,微调”的范式,用于分子表示学习,称为分子连续及时调整(MOLCPT)。 Molcpt定义了一个基序提示函数,该功能使用预训练的模型将独立输入投影到表达提示中。及时在连续表示空间中有有意义的基序有效地增强了分子图。这提供了更多的结构模式,以帮助下游分类器识别分子特性。在几个基准数据集上进行的广泛实验表明,Molcpt有效地将预训练的GNN概括为分子性质预测,无论是否有几个微调步骤。
Molecular representation learning is crucial for the problem of molecular property prediction, where graph neural networks (GNNs) serve as an effective solution due to their structure modeling capabilities. Since labeled data is often scarce and expensive to obtain, it is a great challenge for GNNs to generalize in the extensive molecular space. Recently, the training paradigm of "pre-train, fine-tune" has been leveraged to improve the generalization capabilities of GNNs. It uses self-supervised information to pre-train the GNN, and then performs fine-tuning to optimize the downstream task with just a few labels. However, pre-training does not always yield statistically significant improvement, especially for self-supervised learning with random structural masking. In fact, the molecular structure is characterized by motif subgraphs, which are frequently occurring and influence molecular properties. To leverage the task-related motifs, we propose a novel paradigm of "pre-train, prompt, fine-tune" for molecular representation learning, named molecule continuous prompt tuning (MolCPT). MolCPT defines a motif prompting function that uses the pre-trained model to project the standalone input into an expressive prompt. The prompt effectively augments the molecular graph with meaningful motifs in the continuous representation space; this provides more structural patterns to aid the downstream classifier in identifying molecular properties. Extensive experiments on several benchmark datasets show that MolCPT efficiently generalizes pre-trained GNNs for molecular property prediction, with or without a few fine-tuning steps.