论文标题

基于数字Lyndon的功能嵌入了机器学习方法的测序读数

Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches

论文作者

Bonizzoni, Paola, Costantini, Matteo, De Felice, Clelia, Petescia, Alessia, Pirola, Yuri, Previtali, Marco, Rizzi, Raffaella, Stoye, Jens, Zaccagnino, Rocco, Zizza, Rosalba

论文摘要

在文献中已经提出了特征嵌入方法,以表示序列作为数字向量,以在某些生物信息学研究中使用,例如家庭分类和蛋白质结构预测。最近的理论结果表明,众所周知的林登因化保留了重叠字符串中的共同因素。令人惊讶的是,测序读取的指纹是读取的Lyndon分解变体中连续因子的一系列序列,可有效地保留序列相似性,这表明它是测序读取小说表示的定义的基础。我们提出了一种使用指纹概念的下一代测序(NGS)数据的新型特征嵌入方法。我们提供了一个理论和实验框架,以估计指纹和从中提取的$ k $ mers的行为,称为$ k $ - fingers,可能是用于测序读取的功能嵌入。作为评估此类嵌入有效性的案例研究,我们使用指纹代表RNA-Seq读取并将其分配给最有可能的基因,从而将它们起源于基因的转录本的片段。我们提供了工具Lyn2Vec中提出的方法的实现,该方法生成了测序读取的基于Lyndon的功能嵌入。

Feature embedding methods have been proposed in literature to represent sequences as numeric vectors to be used in some bioinformatics investigations, such as family classification and protein structure prediction. Recent theoretical results showed that the well-known Lyndon factorization preserves common factors in overlapping strings. Surprisingly, the fingerprint of a sequencing read, which is the sequence of lengths of consecutive factors in variants of the Lyndon factorization of the read, is effective in preserving sequence similarities, suggesting it as basis for the definition of novels representations of sequencing reads. We propose a novel feature embedding method for Next-Generation Sequencing (NGS) data using the notion of fingerprint. We provide a theoretical and experimental framework to estimate the behaviour of fingerprints and of the $k$-mers extracted from it, called $k$-fingers, as possible feature embeddings for sequencing reads. As a case study to assess the effectiveness of such embeddings, we use fingerprints to represent RNA-Seq reads and to assign them to the most likely gene from which they were originated as fragments of transcripts of the gene. We provide an implementation of the proposed method in the tool lyn2vec, which produces Lyndon-based feature embeddings of sequencing reads.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源