论文标题
从DNA条形码中推断出分类学的位置,允许发现新的分类单元
Inferring taxonomic placement from DNA barcoding allowing discovery of new taxa
论文作者
论文摘要
在生态学中,将DNA条形码应用于导致包含大量核苷酸序列的数据集的生物样品已成为常见。然后,重点是通过利用包含已知分类单元的参考序列的现有数据库来推断每个序列的分类学放置。这是极具挑战性的,因为i)测序通常仅适用于基因组的相对较小区域,这是由于成本考虑因素; ii)许多序列来自科学未知或没有参考序列的生物体。这些问题可能导致大量分类不确定性,尤其是推断新的分类单元。为了应对这些挑战,我们提出了一类新的贝叶斯非参数分类分类器Bayesant,它们使用物种采样模型先验,允许在每个分类学等级中发现新的分类单元。使用最低等级的简单产品与共轭Dirichlet先验的多项式可能性,开发了高效的算法,以提供每个等级处每个序列的分类单元放置的概率预测。显示贝es人在实际数据中具有出色的性能,包括当测试集中的许多序列属于训练中未观察到的分类群时。
In ecology it has become common to apply DNA barcoding to biological samples leading to datasets containing a large number of nucleotide sequences. The focus is then on inferring the taxonomic placement of each of these sequences by leveraging on existing databases containing reference sequences having known taxa. This is highly challenging because i) sequencing is typically only available for a relatively small region of the genome due to cost considerations; ii) many of the sequences are from organisms that are either unknown to science or for which there are no reference sequences available. These issues can lead to substantial classification uncertainty, particularly in inferring new taxa. To address these challenges, we propose a new class of Bayesian nonparametric taxonomic classifiers, BayesANT, which use species sampling model priors to allow new taxa to be discovered at each taxonomic rank. Using a simple product multinomial likelihood with conjugate Dirichlet priors at the lowest rank, a highly efficient algorithm is developed to provide a probabilistic prediction of the taxa placement of each sequence at each rank. BayesANT is shown to have excellent performance in real data, including when many sequences in the test set belong to taxa unobserved in training.