基于碎片的分子产生的深层生成模型

论文标题

基于碎片的分子产生的深层生成模型

A Deep Generative Model for Fragment-Based Molecule Generation

论文作者

Podda, Marco, Bacciu, Davide, Micheli, Alessio

论文摘要

分子产生是化学信息学中一个具有挑战性的开放问题。当前，应对挑战的深层生成方法属于两个广泛的类别，在分子的代表方式方面有所不同。一种方法将分子图编码为文本字符串，并了解其相应的基于字符的语言模型。另一种表现力的方法直接在分子图上运行。在这项工作中，我们解决了前者的两个局限性：产生无效和重复的分子。为了提高有效性率，我们开发了一种称为碎片的小分子子结构的语言模型，该模型受到众所周知的基于碎片的药物设计范式的启发。换句话说，我们通过片段生成分子碎片，而不是原子。为了提高唯一性率，我们提出了一种基于频率的掩蔽策略，该策略有助于产生不经常片段的分子。我们通过实验表明，我们的模型在很大程度上胜过其他基于语言的竞争对手，可以达到基于图的方法的最先进的表演。此外，即使没有明确的任务特定监督，产生的分子也显示出与训练样本中类似的分子特性。

Molecule generation is a challenging open problem in cheminformatics. Currently, deep generative approaches addressing the challenge belong to two broad categories, differing in how molecules are represented. One approach encodes molecular graphs as strings of text, and learns their corresponding character-based language model. Another, more expressive, approach operates directly on the molecular graph. In this work, we address two limitations of the former: generation of invalid and duplicate molecules. To improve validity rates, we develop a language model for small molecular substructures called fragments, loosely inspired by the well-known paradigm of Fragment-Based Drug Design. In other words, we generate molecules fragment by fragment, instead of atom by atom. To improve uniqueness rates, we present a frequency-based masking strategy that helps generate molecules with infrequent fragments. We show experimentally that our model largely outperforms other language model-based competitors, reaching state-of-the-art performances typical of graph-based approaches. Moreover, generated molecules display molecular properties similar to those in the training sample, even in absence of explicit task-specific supervision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题