论文标题

PREVGEN:基因组研究中的隐私验证方法验证的方法

PROVGEN: A Privacy-Preserving Approach for Outcome Validation in Genomic Research

论文作者

Jiang, Yuzhou, Ji, Tianxi, Ayday, Erman

论文摘要

随着近年来基因组研究越来越流行,由于隐私问题,数据集共享仍然受到限制。这种限制阻碍了研究结果的可重复性和验证,这两者对于在研究过程中识别计算错误至关重要。在本文中,我们介绍了ProVgen,这是一种隐私保护方法,用于共享基因组数据集,以促进全基因组关联研究(GWAS)中的可重复性和结果验证。我们的方法将基因组数据编码到二进制空间中,并应用了两个阶段的过程。首先,我们使用包含生物学特征的基于XOR的机制生成了数据集的私有版本。其次,我们通过调整嘈杂数据集中的次要等位基因频率(MAF)值来恢复数据实用程序,以使用最佳传输与已发布的MAF保持一致。最后,我们将处理后的二进制数据重新转换为其基因组表示,并发布结果数据集。我们在三个现实基因组数据集上评估了ProVGEN,并将其与局部差异隐私和三种基于合成的方法进行了比较。我们表明,我们提出的方案优于检测GWAS结果错误,实现更好的数据实用程序的所有现有方法,并为会员推理攻击(MIAS)提供更高的隐私保护。通过采用我们的方法,基因组研究人员将倾向于共享差异化的私有数据集,同时保持高数据质量以可重现其发现。

As genomic research has grown increasingly popular in recent years, dataset sharing has remained limited due to privacy concerns. This limitation hinders the reproducibility and validation of research outcomes, both of which are essential for identifying computational errors during the research process. In this paper, we introduce PROVGEN, a privacy-preserving method for sharing genomic datasets that facilitates reproducibility and outcome validation in genome-wide association studies (GWAS). Our approach encodes genomic data into binary space and applies a two-stage process. First, we generate a differentially private version of the dataset using an XOR-based mechanism that incorporates biological characteristics. Second, we restore data utility by adjusting the Minor Allele Frequency (MAF) values in the noisy dataset to align with published MAFs using optimal transport. Finally, we convert the processed binary data back into its genomic representation and publish the resulting dataset. We evaluate PROVGEN on three real-world genomic datasets and compare it with local differential privacy and three synthesis-based methods. We show that our proposed scheme outperforms all existing methods in detecting GWAS outcome errors, achieves better data utility, and provides higher privacy protection against membership inference attacks (MIAs). By adopting our method, genomic researchers will be inclined to share differentially private datasets while maintaining high data quality for reproducibility of their findings.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源