论文标题
对高效成对序列比对的排名一型模型的自适应学习
Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment
论文作者
论文摘要
DNA测序数据的成对比对是生物信息学中普遍存在的任务,通常代表沉重的计算负担。最先进的方法来加快此任务用途,以识别由成对读数共享的简短片段(K-MER),然后可以用来估计对齐分数。但是,当读取数量很大时,所有对的对齐得分准确地估算了对齐得分的成本仍然很高。此外,实际上,一个人只对识别具有较高对准分数的读数对。在这项工作中,我们提出了一种基于两种主要新成分的新方法来进行成对对齐估算。第一种成分是在排名一号众包模型的一般框架下提出成对对齐估计的问题,在该模型的一般框架中,工人的响应对应于K-Mer Hash碰撞。这些模型可以通过响应矩阵的光谱分解来准确求解。第二个成分是利用多武器的匪徒算法来适应该光谱估计量,仅用于可能具有较大对准的读取对。所得算法迭代对读取对的自适应子集的响应矩阵进行频谱分解。
Pairwise alignment of DNA sequencing data is a ubiquitous task in bioinformatics and typically represents a heavy computational burden. State-of-the-art approaches to speed up this task use hashing to identify short segments (k-mers) that are shared by pairs of reads, which can then be used to estimate alignment scores. However, when the number of reads is large, accurately estimating alignment scores for all pairs is still very costly. Moreover, in practice, one is only interested in identifying pairs of reads with large alignment scores. In this work, we propose a new approach to pairwise alignment estimation based on two key new ingredients. The first ingredient is to cast the problem of pairwise alignment estimation under a general framework of rank-one crowdsourcing models, where the workers' responses correspond to k-mer hash collisions. These models can be accurately solved via a spectral decomposition of the response matrix. The second ingredient is to utilise a multi-armed bandit algorithm to adaptively refine this spectral estimator only for read pairs that are likely to have large alignments. The resulting algorithm iteratively performs a spectral decomposition of the response matrix for adaptively chosen subsets of the read pairs.