论文标题
通过位置洞穴 - 轮式转换来教授洞穴 - 轮毂变换
Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform
论文作者
论文摘要
Burrows-Wheeler Transform(BWT)经常在算法生物信息学的本科课程中教授,因为它是FM索引的基础,因此是Bowtie和Bwa等重要工具。它的仰慕者认为BWT是一个美丽的事物,但是尽管有成千上万的页面在近三十年的时间里写了成千上万的页面,但本科生首次看到它通常仍然是魔术。后来,一些坚持不懈的人被展示了BWT的位置BWT(PBWT),该位置在BWT二十年后出版。在本文中,我们认为应该教pbwt {\ em之前} bwt。 我们首先使用PBWT与左侧的radix排序的密切关系来解释如何将其用作一组字符串上{\ em Positional Search}的快速和空间效率的索引(即给定的模式和位置,快速列出包含该模式以该位置启动的字符串)。然后,我们观察到{\ em前缀search}(列出所有以模式开头的字符串)是一个简单的位置搜索的特殊情况,并且在单个字符串的后缀上的前缀搜索等于{\ em substring search}在该字符串中(列出字符串中图案中所有开始位置的启动位置)。 存储字符串后缀的幼稚的pbwt是空间 - {\ em效率低下},但在相当小的示例中,它的大多数列几乎相同。不难证明,如果我们存储字符串的循环移位的PBWT,而不是其后缀,那么所有列都是完全相同的 - 并且等于字符串的BWT。因此,我们可以通过PBWT教授BWT和FM索引。
The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic. Some who persevere are later shown the Positional BWT (PBWT), which was published twenty years after the BWT. In this paper we argue that the PBWT should be taught {\em before} the BWT. We first use the PBWT's close relation to a right-to-left radix sort to explain how to use it as a fast and space-efficient index for {\em positional search} on a set of strings (that is, given a pattern and a position, quickly list the strings containing that pattern starting in that position). We then observe that {\em prefix search} (listing all the strings that start with the pattern) is an easy special case of positional search, and that prefix search on the suffixes of a single string is equivalent to {\em substring search} in that string (listing all the starting positions of occurrences of the pattern in the string). Storing naïvely a PBWT of the suffixes of a string is space-{\em inefficient} but, in even reasonably small examples, most of its columns are nearly the same. It is not difficult to show that if we store a PBWT of the cyclic shifts of the string, instead of its suffixes, then all the columns are exactly the same -- and equal to the BWT of the string. Thus we can teach the BWT and the FM-index via the PBWT.