论文标题
EPGAT:图形注意网络的基因必要性预测
EPGAT: Gene Essentiality Prediction With Graph Attention Networks
论文作者
论文摘要
基本基因/蛋白质的鉴定是迈向更好地理解人类生物学和病理学的关键一步。计算方法通过探索机器学习(ML)方法以及与生物学信息(尤其是蛋白质 - 蛋白质相互作用(PPI)网络)的相关性通过探索机器学习(ML)方法来减轻实验约束,以预测必要的基因。尽管如此,由于基于网络的核心不是本质的独家代理,而且传统的ML方法无法从非欧几里得域(例如图形)中学习,因此它们的性能仍然有限。考虑到这些局限性,我们提出了EPGAT,这是基于图形注意力网络(GAT)的必要性预测方法,该方法是基于注意力的图形神经网络(GNN),可在图形结构数据上运行。我们的模型直接从PPI网络中学习了基因本质的模式,从而集成了编码为节点属性的多组学数据的其他证据。我们为包括人类在内的四种生物(包括人类)进行了基准测试,可以准确地预测基因的必要性,而AUC评分范围为0.78至0.97。我们的模型显着超过了基于网络和浅的基于ML的方法,并针对最先进的Node2VEC嵌入方法实现了非常具竞争力的性能。值得注意的是,EPGAT是有限和不平衡培训数据的情况下最强大的方法。因此,提出的方法提供了一种强大而有效的方法来识别基本基因和蛋白质。
The identification of essential genes/proteins is a critical step towards a better understanding of human biology and pathology. Computational approaches helped to mitigate experimental constraints by exploring machine learning (ML) methods and the correlation of essentiality with biological information, especially protein-protein interaction (PPI) networks, to predict essential genes. Nonetheless, their performance is still limited, as network-based centralities are not exclusive proxies of essentiality, and traditional ML methods are unable to learn from non-Euclidean domains such as graphs. Given these limitations, we proposed EPGAT, an approach for essentiality prediction based on Graph Attention Networks (GATs), which are attention-based Graph Neural Networks (GNNs) that operate on graph-structured data. Our model directly learns patterns of gene essentiality from PPI networks, integrating additional evidence from multiomics data encoded as node attributes. We benchmarked EPGAT for four organisms, including humans, accurately predicting gene essentiality with AUC score ranging from 0.78 to 0.97. Our model significantly outperformed network-based and shallow ML-based methods and achieved a very competitive performance against the state-of-the-art node2vec embedding method. Notably, EPGAT was the most robust approach in scenarios with limited and imbalanced training data. Thus, the proposed approach offers a powerful and effective way to identify essential genes and proteins.