论文标题
通过采样工件和错误的支持估计
Support Estimation with Sampling Artifacts and Errors
论文作者
论文摘要
在机器学习,计算机科学,物理学和生物学的许多领域,估计分布支持的问题非常重要。该领域中的大多数现有工作都集中在采用完全准确的采样方法的设置上,这在实际数据科学中很少是正确的。在这里,我们介绍了在存在采样伪像和误差的情况下支持估计的第一种已知方法,其中假定每个样品是由泊松重复通道产生的,该通道同时捕获了样本的重复和缺失。提出的估计量基于正则加权Chebyshev近似,其权重由所谓的TouchArd(Bell)多项式评估控制。在存在采样伪像的情况下,使用离散的半无限编程方法计算了支持。估计方法对合成数据和文本数据以及收集的GISAID数据进行了测试,以解决计算生物学中的新问题:SARS-COV-2病毒基因中的突变支持估计。在后来的环境中,泊松通道捕获了以下事实:许多人多次测试病毒RNA,从而导致重复样本,而由于测试错误,其他人的结果未记录。对于所有执行的实验,我们观察到我们的综合方法的显着改善,与通过对最新的无噪声支持估计方法进行的充分修改获得的实验相比。
The problem of estimating the support of a distribution is of great importance in many areas of machine learning, computer science, physics and biology. Most of the existing work in this domain has focused on settings that assume perfectly accurate sampling approaches, which is seldom true in practical data science. Here we introduce the first known approach to support estimation in the presence of sampling artifacts and errors where each sample is assumed to arise from a Poisson repeat channel which simultaneously captures repetitions and deletions of samples. The proposed estimator is based on regularized weighted Chebyshev approximations, with weights governed by evaluations of so-called Touchard (Bell) polynomials. The supports in the presence of sampling artifacts are calculated using discretized semi-infite programming methods. The estimation approach is tested on synthetic and textual data, as well as on GISAID data collected to address a new problem in computational biology: mutational support estimation in genes of the SARS-Cov-2 virus. In the later setting, the Poisson channel captures the fact that many individuals are tested multiple times for the presence of viral RNA, thereby leading to repeated samples, while other individual's results are not recorded due to test errors. For all experiments performed, we observed significant improvements of our integrated methods compared to those obtained through adequate modifications of state-of-the-art noiseless support estimation methods.