论文标题
R-Enum:BWT-Runs有界空间中特征子字的枚举
R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space
论文作者
论文摘要
给定的字符串中列举特征的子字(例如,最大重复序列,最小独特的子字符串和最小缺乏单词)一直是一个重要的研究主题,因为在弦乐处理和计算生物学等各个领域中都有各种各样的应用。尽管已经提出了一些针对特征性子字符串的枚举算法,但它们的空间用途与输入字符串的长度成正比,它们的空间效率不高。最近,运行长度的编码洞穴 - 轮毂变换(RLBWT)引起了弦乐处理的越来越多的关注,并且已经开发了RLBWT的各种算法。但是,使用RLBWT开发针对特征基因的枚举算法仍然是一个挑战。在本文中,我们介绍了R-Enum(基于RLBWT的枚举),这是基于RLBWT的特征性基因的第一种枚举算法。 r-enum以$ o(n \ log \ log \ log(n/r))$时间运行,并带有$ o(r \ log n)$的工作空间,用于字符串长度$ n $和rlbwt中的数字$ r $ r $,其中$ r $的运行预计将明显小于高度重复的弦乐(即与许多复杂性的弦乐)相当小的$ n $。使用高度重复字符串的基准数据集的实验表明,R-Enum的结果比以前的结果更高。此外,我们通过对100个人类基因组的300吉比亚特字符串进行实验来证明R-Enum对巨大的弦的适用性。
Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in $O(n \log \log (n/r))$ time and with $O(r \log n)$ bits of working space for string length $n$ and number $r$ of runs in RLBWT, where $r$ is expected to be significantly smaller than $n$ for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes.