PFP数据结构

论文标题

PFP数据结构

PFP Data Structures

论文作者

Boucher, Christina, Cvacho, Ondřej, Gagie, Travis, Holub, Jan, Manzini, Giovanni, Navarro, Gonzalo, Rossi, Massimiliano

论文摘要

Boucher等人引入了无前缀解析（PFP）。（2019年）作为简化基因组数据库洞穴转换（BWTS）的计算的预处理步骤。给定一个字符串$ s $，它产生了词典$ d $和一个重叠短语的解析$ p $，因此可以从$ d $和$ d $和$ p $的时间内计算$ \ mathrm {bwt}（s）$。在实践中，$ d $和$ p $明显小于$ s $，并且从它们的计算$ \ mathrm {bwt}（s）$比直接从$ s $中计算它的效率更高，至少是当$ s $由同一物种的个体组成的基因组成时。在本文中，我们将$ \ mathrm {pfp}（s）$视为{\ em数据结构}，并显示如何将其增强以迅速支持以下查询，但仍位于$ o（| \ mathrm {pfp}（pfp}（s）|）$ space $ space：最多常见（lce），save（lce），safix array（sa），loc torge and bel（lc）。最后，我们提供了实验证据，表明PFP数据结构可以有效地为非常大的重复数据集构建：它需要一个小时和54 GB的峰值存储器，以$ 1000 $的人类染色体19级，最初占用约56 GB。

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string $S$, it produces a dictionary $D$ and a parse $P$ of overlapping phrases such that $\mathrm{BWT} (S)$ can be computed from $D$ and $P$ in time and workspace bounded in terms of their combined size $|\mathrm{PFP} (S)|$. In practice $D$ and $P$ are significantly smaller than $S$ and computing $\mathrm{BWT} (S)$ from them is more efficient than computing it from $S$ directly, at least when $S$ consists of genomes from individuals of the same species. In this paper, we consider $\mathrm{PFP} (S)$ as a {\em data structure} and show how it can be augmented to support the following queries quickly, still in $O (|\mathrm{PFP} (S)|)$ space: longest common extension (LCE), suffix array (SA), longest common prefix (LCP) and BWT. Lastly, we provide experimental evidence that the PFP data structure can be efficiently constructed for very large repetitive datasets: it takes one hour and 54 GB peak memory for $1000$ variants of human chromosome 19, initially occupying roughly 56 GB.

下载PDF全文

下载文献需遵守相关版权规定

论文标题