论文标题

自然语言处理根据基因的功能聚类

Natural language processing for clusterization of genes according to their functions

论文作者

Dordiuk, Vladislav, Demicheva, Ekaterina, Espino, Fernando Polanco, Ushenin, Konstantin

论文摘要

有数百种用于分析mRNA测序中获得的数据的方法。其中大多数集中在少量基因上。在这项研究中,我们提出了一种方法,该方法将数千个基因分析以分析几个簇的分析。基因列表丰富了来自开放数据库的信息。然后,使用验证的语言模型(BERT)和一些文本处理方法将描述编码为向量。编码的基因函数通过维度降低和聚类。为了找到最有效的管道,在主要管道步骤中采用不同方法的180例管道。通过聚类指数和结果的专家审查对性能进行了评估。

There are hundreds of methods for analysis of data obtained in mRNA-sequencing. The most of them are focused on small number of genes. In this study, we propose an approach that reduces the analysis of several thousand genes to analysis of several clusters. The list of genes is enriched with information from open databases. Then, the descriptions are encoded as vectors using the pretrained language model (BERT) and some text processing approaches. The encoded gene function pass through the dimensionality reduction and clusterization. Aiming to find the most efficient pipeline, 180 cases of pipeline with different methods in the major pipeline steps were analyzed. The performance was evaluated with clusterization indexes and expert review of the results.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源