论文标题

水合结构:用于基因组组装的安全和完整算法的通用框架

The Hydrostructure: a Universal Framework for Safe and Complete Algorithms for Genome Assembly

论文作者

Cairo, Massimo, Khan, Shahbaz, Rizzi, Romeo, Schmidt, Sebastian, Tomescu, Alexandru I., Zirondelli, Elia C.

论文摘要

基因组组装是生物信息学中的一个基本问题,需要从一组读数(从基因组中测序的短字符串)中重建一个源基因组。基因组组装解决方案的概念是图形的弧形行走。由于集会图承认许多解决方案,因此目标是找到所有解决方案中绝对存在的问题,或者是安全的。大多数实用的汇编机都是基于启发式方法的核心单位,即内部节点的路径具有单位内度和高度,并且显然是安全的。最近解决了找到解决方案的所有安全部分的长期开放问题[Repomb 2016],重叠长度增加了60%。这种安全而完整的基因组组装算法之后是其他作品改善了时间边界,并扩展了不同组装解决方案概念的结果。但是,对于实际适用性的基因组组装模型,是否也可以完成,这是保持开放的。 在本文中,我们提出了一个通用框架,用于获取安全,完整的算法,该算法可以统一先前的结果,同时还可以轻松地概括用于组装问题,包括许多实际方面。这是基于一种新颖的图形结构,称为步行的水解结构,该结构从步行的角度突出了图形的可及性能。水液结构允许对现有安全步行及其新实用版本的简单特征。我们几乎所有的特征都直接适用于最佳验证算法和简单的枚举算法。这些算法中的大多数也可以通过增量计算过程和特定模型的先前最佳算法提高到最佳性。

Genome assembly is a fundamental problem in Bioinformatics, requiring to reconstruct a source genome from an assembly graph built from a set of reads (short strings sequenced from the genome). A notion of genome assembly solution is that of an arc-covering walk of the graph. Since assembly graphs admit many solutions, the goal is to find what is definitely present in all solutions, or what is safe. Most practical assemblers are based on heuristics having at their core unitigs, namely paths whose internal nodes have unit in-degree and out-degree, and which are clearly safe. The long-standing open problem of finding all the safe parts of the solutions was recently solved [RECOMB 2016] yielding a 60% increase in contig length. This safe and complete genome assembly algorithm was followed by other works improving the time bounds, as well as extending the results for different notions of assembly solution. But it remained open whether one can be complete also for models of genome assembly of practical applicability. In this paper we present a universal framework for obtaining safe and complete algorithms which unify the previous results, while also allowing for easy generalisations to assembly problems including many practical aspects. This is based on a novel graph structure, called the hydrostructure of a walk, which highlights the reachability properties of the graph from the perspective of the walk. The hydrostructure allows for simple characterisations of the existing safe walks, and of their new practical versions. Almost all of our characterisations are directly adaptable to optimal verification algorithms, and simple enumeration algorithms. Most of these algorithms are also improved to optimality using an incremental computation procedure and a previous optimal algorithm of a specific model.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源