R-Enum：BWT-Runs有界空间中特征子字的枚举

论文标题

R-Enum：BWT-Runs有界空间中特征子字的枚举

R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space

论文作者

Nishimoto, Takaaki, Tabei, Yasuo

论文摘要

给定的字符串中列举特征的子字（例如，最大重复序列，最小独特的子字符串和最小缺乏单词）一直是一个重要的研究主题，因为在弦乐处理和计算生物学等各个领域中都有各种各样的应用。尽管已经提出了一些针对特征性子字符串的枚举算法，但它们的空间用途与输入字符串的长度成正比，它们的空间效率不高。最近，运行长度的编码洞穴 - 轮毂变换（RLBWT）引起了弦乐处理的越来越多的关注，并且已经开发了RLBWT的各种算法。但是，使用RLBWT开发针对特征基因的枚举算法仍然是一个挑战。在本文中，我们介绍了R-Enum（基于RLBWT的枚举），这是基于RLBWT的特征性基因的第一种枚举算法。 r-enum以$ o（n \ log \ log \ log（n/r））$时间运行，并带有$ o（r \ log n）$的工作空间，用于字符串长度$ n $和rlbwt中的数字$ r $ r $，其中$ r $的运行预计将明显小于高度重复的弦乐（即与许多复杂性的弦乐）相当小的$ n $。使用高度重复字符串的基准数据集的实验表明，R-Enum的结果比以前的结果更高。此外，我们通过对100个人类基因组的300吉比亚特字符串进行实验来证明R-Enum对巨大的弦的适用性。

Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in $O(n \log \log (n/r))$ time and with $O(r \log n)$ bits of working space for string length $n$ and number $r$ of runs in RLBWT, where $r$ is expected to be significantly smaller than $n$ for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题