关于符号数据集的污染

论文标题

关于符号数据集的污染

On Contamination of Symbolic Datasets

论文作者

Pearson, Antony, Lladser, Manuel E.

论文摘要

在离散样品空间上获取值的数据是现代生物学研究的体现。 “ OMICS”实验以读数的形式产生数百万个符号结果（即，几十个至几百个核苷酸的DNA序列）。不幸的是，这些本质上非数字数据集通常受到高度污染，并且可能污染的可能来源通常较差。这与高斯型噪声通常被良好的数字数据集形成鲜明对比。为了克服这一障碍，我们介绍了潜在重量的概念，该概念衡量了符合结构良好的所需模型中模型的污染概率来源中最大的预期样品分数。我们检查了潜在权重的各种特性，我们专门针对可交换概率分布的类别。作为概念验证，我们分析了22个人类自染色体对的DNA甲基化数据。与通常假设的相反，我们提供了有力的证据，表明当考虑到污染时，在某些基因组位置高度特异性的甲基化模式被过度代表。

Data taking values on discrete sample spaces are the embodiment of modern biological research. "Omics" experiments produce millions of symbolic outcomes in the form of reads (i.e., DNA sequences of a few dozens to a few hundred nucleotides). Unfortunately, these intrinsically non-numerical datasets are often highly contaminated, and the possible sources of contamination are usually poorly characterized. This contrasts with numerical datasets where Gaussian-type noise is often well-justified. To overcome this hurdle, we introduce the notion of latent weight, which measures the largest expected fraction of samples from a contaminated probabilistic source that conform to a model in a well-structured class of desired models. We examine various properties of latent weights, which we specialize to the class of exchangeable probability distributions. As proof of concept, we analyze DNA methylation data from the 22 human autosome pairs. Contrary to what it is usually assumed, we provide strong evidence that highly specific methylation patterns are overrepresented at some genomic locations when contamination is taken into account.

下载PDF全文

下载文献需遵守相关版权规定

论文标题