论文标题
串联重复距离问题与有限字母相比很难
The Tandem Duplication Distance Problem is hard over bounded alphabets
论文作者
论文摘要
串联重复表示插入与其原始位置相邻的DNA片段副本的过程。更正式地,可以将串联重复视为将字符串$ s = axb $转换为字符串$ t = axxb的操作,因为它们似乎参与遗传疾病,因此在计算生物学中广泛研究了串联重复。此外,最近在不同的情况下,从形式语言到信息理论到DNA存储系统的错误校正代码,已经研究了串联重复机制。 [Leupold等人,2004年]提出了确定计算两个给定字符串之间串联重复距离的复杂性的问题,最近,对于无界字母的情况,它被证明是NP-HARD [LAFOND等人,STACS2020]。在本文中,我们显着改善了这一结果,并表明,对于字母$ \ \ \ \ \ leq 5的字母而言,串联重复距离问题已经是NP-HARD。解决存在问题的多项式时间算法仅因二元字母而闻名。
A tandem duplication denotes the process of inserting a copy of a segment of DNA adjacent to its original position. More formally, a tandem duplication can be thought of as an operation that converts a string $S = AXB$ into a string $T = AXXB.$ As they appear to be involved in genetic disorders, tandem duplications are widely studied in computational biology. Also, tandem duplication mechanisms have been recently studied in different contexts, from formal languages, to information theory, to error-correcting codes for DNA storage systems. The problem of determining the complexity of computing the tandem duplication distance between two given strings was proposed by [Leupold et al., 2004] and, very recently, it was shown to be NP-hard for the case of unbounded alphabets [Lafond et al., STACS2020]. In this paper, we significantly improve this result and show that the tandem duplication distance problem is NP-hard already for the case of strings over an alphabet of size $\leq 5.$ We also study some special classes of strings were it is possible to give linear time solutions to the existence problem: given strings $S$ and $T$ over the same alphabet, decide whether there exists a sequence of duplications converting $S$ into $T$. A polynomial time algorithm that solves the existence problem was only known for the case of the binary alphabet.