概况预测：基于对齐的蛋白质序列模型的预训练任务

论文标题

概况预测：基于对齐的蛋白质序列模型的预训练任务

Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models

论文作者

Sturmfels, Pascal, Vig, Jesse, Madani, Ali, Rajani, Nazneen Fatema

论文摘要

对于蛋白质序列数据集，由于湿lab表征的高成本，未标记的数据已大大超过标记的数据。蛋白质预测的最新深度学习方法表明，对未标记数据的预训练可以为下游任务产生有用的表示形式。但是，最佳的培训策略仍然是一个悬而未决的问题。我们引入了一个新的预训练任务：直接预测从多个序列比对得出的蛋白质配置文件，我们没有以掩盖或自回归语言建模的形式严格借用自然语言处理（NLP）借用。使用五个蛋白质模型的五组标准化的下游任务，我们证明了我们的预训练任务以及多任务目标在所有五个任务上都超过了掩盖语言建模。我们的结果表明，蛋白质序列模型可能会受益于利用以NLP中现有的语言建模技术超出现有语言建模技术的生物启发性偏见。

For protein sequence datasets, unlabeled data has greatly outpaced labeled data due to the high cost of wet-lab characterization. Recent deep-learning approaches to protein prediction have shown that pre-training on unlabeled data can yield useful representations for downstream tasks. However, the optimal pre-training strategy remains an open question. Instead of strictly borrowing from natural language processing (NLP) in the form of masked or autoregressive language modeling, we introduce a new pre-training task: directly predicting protein profiles derived from multiple sequence alignments. Using a set of five, standardized downstream tasks for protein models, we demonstrate that our pre-training task along with a multi-task objective outperforms masked language modeling alone on all five tasks. Our results suggest that protein sequence models may benefit from leveraging biologically-inspired inductive biases that go beyond existing language modeling techniques in NLP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题