论文标题
使用校正后的长读数的K-MER频率估算基因组大小
Estimation of genome size using k-mer frequencies from corrected long reads
论文作者
论文摘要
在从头组装研究中,第三代长期读取测序技术(例如PACBIO和NANOPORE)比第二代Illumina测序具有很大的优势。但是,由于固有的低基准精度,第三代测序数据不能用于基于K-MER频率的K-MER计数和估计基因组谱。因此,在当前的基因组项目中,第二代数据对于准确确定基因组大小和其他基因组特征也是必需的。我们表明,校正后的第三代数据可用于计数K-MER频率并可靠地估算使用第二代数据的基因组大小。因此,未来的基因组项目只能依靠一项测序技术来完成组装和K-MER分析,这将在很大程度上降低时间和金钱的测序成本。此外,我们提出了一个快速重量的工具Kmerfreq,并使用它来执行此工作中的所有K-MER计数任务。我们已经证明,校正后的第三代测序数据可用于估计基因组大小,并开发了一种新的开源C/C ++ K-MER计数工具KmerfReq,该工具可在https://github.com/fanagiblab/kmerfreq上免费获得。
The third-generation long reads sequencing technologies, such as PacBio and Nanopore, have great advantages over second-generation Illumina sequencing in de novo assembly studies. However, due to the inherent low base accuracy, third-generation sequencing data cannot be used for k-mer counting and estimating genomic profile based on k-mer frequencies. Thus, in current genome projects, second-generation data is also necessary for accurately determining genome size and other genomic characteristics. We show that corrected third-generation data can be used to count k-mer frequencies and estimate genome size reliably, in replacement of using second-generation data. Therefore, future genome projects can depend on only one sequencing technology to finish both assembly and k-mer analysis, which will largely decrease sequencing cost in both time and money. Moreover, we present a fast light-weight tool kmerfreq and use it to perform all the k-mer counting tasks in this work. We have demonstrated that corrected third-generation sequencing data can be used to estimate genome size and developed a new open-source C/C++ k-mer counting tool, kmerfreq, which is freely available at https://github.com/fanagislab/kmerfreq.