论文标题

索引高度重复的字符串集合

Indexing Highly Repetitive String Collections

论文作者

Navarro, Gonzalo

论文摘要

二十年前,索引字符串集合的突破使得可以在压缩空间中表示它们,同时提供索引搜索功能。随着这项新技术通过生物信息学等应用渗透了,弦乐集合经历了一种超越摩尔定律的增长,并挑战了我们即使以压缩形式处理它们的能力。事实证明,幸运的是,许多这些快速生长的弦乐集收集具有高度重复性,因此它们的信息内容比其朴素的尺寸低。但是,用于经典集合的统计压缩方法对这种重复性视而不见,因此已经开发了一套新的技术,以正确利用它。结果索引形成了新一代的数据结构,能够处理我们面临的巨大重复弦乐集。 在这项调查中,我们涵盖了导致这些数据结构的算法发展。我们描述了用于利用重复性的独特压缩范式,构成了所有现有索引的基础的基本算法思想以及提出的各种结构,并在理论和实际方面进行了比较。我们在这个迷人的领域中遇到了目前的挑战。

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源