论文标题
对人工神经网络的结构预测的蛋白质主要序列的最佳编码研究
An Investigation in Optimal Encoding of Protein Primary Sequence for Structure Prediction by Artificial Neural Networks
论文作者
论文摘要
在过去的几年中,机器学习和神经网络的使用急剧增加,主要是由于对数据的可访问性不断增加和计算能力的增长。利用机器学习的力量来实现预测任务已经变得越来越容易。蛋白质结构预测是神经网络变得越来越流行和成功的领域。尽管非常强大,但ANN的使用需要选择最合适的输入/输出编码,体系结构和类才能产生最佳结果。在这项调查中,我们探索并评估了几种常规和新提出的输入编码的效果,并选择了最佳体系结构。我们考虑了11个输入编码,11种替代窗口尺寸和7种不同架构的变体。总的来说,我们在3个月的时间内评估了对10,000多个蛋白质结构的培训和测试的2,541个排列。我们的调查得出的结论是,单次编码,LSTM的使用以及9、11和15的窗口大小产生最佳结果。通过这种优化,我们能够通过预测在14°-16°以内的二二核和ψ二二脑的预测来提高蛋白质结构预测的质量。与以前类似的研究相比,这是一个显着的改进。
Machine learning and the use of neural networks has increased precipitously over the past few years primarily due to the ever-increasing accessibility to data and the growth of computation power. It has become increasingly easy to harness the power of machine learning for predictive tasks. Protein structure prediction is one area where neural networks are becoming increasingly popular and successful. Although very powerful, the use of ANN require selection of most appropriate input/output encoding, architecture, and class to produce the optimal results. In this investigation we have explored and evaluated the effect of several conventional and newly proposed input encodings and selected an optimal architecture. We considered 11 variations of input encoding, 11 alternative window sizes, and 7 different architectures. In total, we evaluated 2,541 permutations in application to the training and testing of more than 10,000 protein structures over the course of 3 months. Our investigations concluded that one-hot encoding, the use of LSTMs, and window sizes of 9, 11, and 15 produce the optimal outcome. Through this optimization, we were able to improve the quality of protein structure prediction by predicting the ϕ dihedrals to within 14° - 16° and ψ dihedrals to within 23°- 25°. This is a notable improvement compared to previously similar investigations.