论文标题
用于构建大型隧道式旋转器图的无前缀解析
Prefix-free parsing for building large tunnelled Wheeler graphs
论文作者
论文摘要
我们提出了一种新技术,用于为大型重复的文本收集创建一个空间效率指数,例如包含来自同一物种许多个体的序列的Pangenomic数据库。我们结合了来自该领域的两种最新技术:Wheeler图(Gagie等,2017)和无前解析(PFP,Boucher等,2019)。 Wheeler图(WGS)是一个一般框架,其中包括基于Burrows-Wheeler Transform(BWT)的几个索引,例如FM索引。惠勒图承认了一种简洁的表示,可以通过采用隧道的概念进一步压实,该想法以平行的,同样标记的路径的形式利用冗余,称为块,可以将其合并为单个路径。找到用于隧道的最佳块集的问题,即最小化所得WG大小的隧道块,已知是NP完整的,并且仍然是隧道过程中最具挑战性的部分。 为了在更少的时间内找到一组足够的块,我们提出了一种基于无前缀解析(PFP)的新方法。 PFP的想法是将输入文本分为大小相等大小的短语,这些短语与固定数量的字符重叠。原始文本由一系列短语等级表示(解析)和所有使用的短语(字典)的列表。在重复的文本中,文本的PFP通常比原始文本短得多。为了加快用于隧道的块选择,我们使用PFP来获取文本的解析和词典,使用现有的启发式学隧道隧道隧道的tunnel tunnel tunnel tunnel tunnel tunnel tunnel the tunnel parse,然后使用此隧道解析来构造原始文本的紧凑型wg。与没有PFP的原始文本构造WG相比,我们的方法更快,并且在pangenomic序列的集合中使用的记忆更少。因此,我们的方法可以将WGS用作现实世界数据集的Pangenomic参考。
We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs (WGs) are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting WG, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP of the text is generally much shorter than the original. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the text, tunnel the WG of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact WG of the original text. Compared with constructing a WG from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of WGs as a pangenomic reference for real-world datasets.