论文标题

均匀空间中的底带复杂性

Substring Complexity in Sublinear Space

论文作者

Bernardini, Giulia, Fici, Gabriele, Gawrychowski, Paweł, Pissis, Solon P.

论文摘要

香农的熵是统计压缩的确定下限。不幸的是,没有这样的明确措施来重复字符串的可压缩性。因此,采用了临时措施来估计字符串的重复性,例如,Lempel-Ziv Parse的尺寸$ Z $或Burrows-Wheeler-wheeler Transform的同等流行运行的数量$ r $。最近的一个是最小的弦吸引子的尺寸$γ$。令$ t $为长度$ n $的字符串。 $ t $的字符串吸引子是$ t $的一组职位,捕获了$ t $的所有子字符串的发生。不幸的是,Kempa和Prezza [Stoc 2018]表明,计算$γ$是NP-HARD。 Kociumaka等。 [Latin 2020]考虑了一种基于函数$ s_t(k)$的新的可压缩度量度,计算$ t $的长度$ k $的不同子字符串的数量,也称为$ t $的子字符串复杂性。该新措施定义为$δ= \ sup \ {s_t(k)/k,k \ geq 1 \} $,并下降了先前考虑的所有相关临时度量。特别是,$δ\leqγ$始终保持,$δ$可以使用$ \ Mathcal {o}(n)$ time使用$θ(n)$工作空间计算。 Kociumaka等。表明一个人可以构建一个$ \ MATHCAL {O}(δ\ log \ frac {n}δ)$ - $ t $的大小表示,支持有效的直接访问和$ t $上的有效的直接访问和有效的模式匹配查询。鉴于对于高度可压缩的字符串,$δ$明显小于$ n $,因此提出以下问题是很自然的:我们可以使用sublinear工作空间高效地计算$δ$吗? 我们通过显示以下范围来计算$δ$:$ \ mathcal {o}(\ frac {n^3 \ log b} {b^2})$ time使用$ \ mathcal {o}(o或使用$ \ tilde {\ tilde {\ Mathcal {o}}(b)$ space,对于任何$ b \ in [\ sqrt {n},n] $,in Word ram模型中的任何$ b \ in [\ tilde {\ mathcal {o}}}(n^2/b)$ time $ time $ time。

Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size $z$ of the Lempel-Ziv parse or the number $r$ of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size $γ$ of a smallest string attractor. Let $T$ be a string of length $n$. A string attractor of $T$ is a set of positions of $T$ capturing the occurrences of all the substrings of $T$. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing $γ$ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function $S_T(k)$ counting the number of distinct substrings of length $k$ of $T$, also known as the substring complexity of $T$. This new measure is defined as $δ= \sup\{S_T(k)/k, k\geq 1\}$ and lower bounds all the relevant ad hoc measures previously considered. In particular, $δ\leq γ$ always holds and $δ$ can be computed in $\mathcal{O}(n)$ time using $Θ(n)$ working space. Kociumaka et al. showed that one can construct an $\mathcal{O}(δ\log \frac{n}δ)$-sized representation of $T$ supporting efficient direct access and efficient pattern matching queries on $T$. Given that for highly compressible strings, $δ$ is significantly smaller than $n$, it is natural to pose the following question: Can we compute $δ$ efficiently using sublinear working space? We address this algorithmic challenge by showing the following bounds to compute $δ$: $\mathcal{O}(\frac{n^3\log b}{b^2})$ time using $\mathcal{O}(b)$ space, for any $b\in[1,n]$, in the comparison model; or $\tilde{\mathcal{O}}(n^2/b)$ time using $\tilde{\mathcal{O}}(b)$ space, for any $b\in[\sqrt{n},n]$, in the word RAM model.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源