论文标题

Align-gram:重新思考蛋白质序列分析的跳过模型

Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

论文作者

Ibtehaz, Nabil, Sourav, S. M. Shakhawat Hossain, Bayzid, Md. Shamsuzzoha, Rahman, M. Sohel

论文摘要

背景:下一代测序技术的成立已成倍增加了生物序列数据的体积。蛋白质序列被引用为“生命的语言”,已被分析以用于多种应用和推论。 动机:由于深度学习的快速发展,近年来,自然语言处理领域已经有许多突破。由于这些方法在接受足够数量的数据培训时能够执行不同的任务,因此使用现成的模型来执行各种生物应用。在这项研究中,我们研究了流行的Skip-gram模型在蛋白质序列分析中的适用性,并试图将一些生物学见解纳入其中。 结果:我们提出了一种新颖的$ k $ -mer嵌入方案Align-gram,该方案能够在矢量空间中绘制相似的$ k $ -mers。此外,我们尝试了其他基于序列的蛋白质表示,并观察到源自Align-Gram AIDS建模和训练深度学习模型的嵌入。我们使用简单的基线LSTM模型和DeepGoplus的CNN模型进行的实验表明,Align-gram在执行不同类型的深度学习应用中进行蛋白质序列分析的潜力。

Background: The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the `language of life', has been analyzed for a multitude of applications and inferences. Motivation: Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. Results: We propose a novel $k$-mer embedding scheme, Align-gram, which is capable of mapping the similar $k$-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源