生成合成的临床数据，以捕获具有生成对抗网络的类分布的类不平衡分布：示例使用抗逆转录病毒疗法进行HIV

论文标题

生成合成的临床数据，以捕获具有生成对抗网络的类分布的类不平衡分布：示例使用抗逆转录病毒疗法进行HIV

Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV

论文作者

Kuo, Nicholas I-Hsien, Garcia, Federico, Sönnerborg, Anders, Zazzi, Maurizio, Böhm, Michael, Kaiser, Rolf, Polizzotto, Mark, Jorm, Louisa, Barbieri, Sebastiano

论文摘要

由于其高度机密的性质，通常无法自由分发临床数据，这阻碍了医疗领域的机器学习的发展。缓解此问题的一种方法是使用生成对抗网络（GAN）生成现实的合成数据集。但是，已知甘斯患有模式崩溃，从而产生低多样性的产量。这降低了综合医疗保健数据的质量，并可能导致它忽略少数人口统计患者或忽略较少常见的临床实践。在本文中，我们使用额外的变异自动编码器（VAE）扩展了经典的GAN设置，并包括外部内存，以重播从真实样品到GAN发电机观察到的潜在特征。使用抗逆转录病毒治疗作为人类免疫缺陷病毒（HIV的ART）作为案例研究，我们表明我们的扩展设置克服了模式崩溃并产生一个合成数据集，该数据集准确地描述了现实世界中临床变量中常见的严重不平衡的类别分布。此外，我们证明了我们的合成数据集与患者的披露风险非常低有关，并且它保留了从地面真相数据集中的高水平，以支持下游机器学习算法的开发。

Clinical data usually cannot be freely distributed due to their highly confidential nature and this hampers the development of machine learning in the healthcare domain. One way to mitigate this problem is by generating realistic synthetic datasets using generative adversarial networks (GANs). However, GANs are known to suffer from mode collapse thus creating outputs of low diversity. This lowers the quality of the synthetic healthcare data, and may cause it to omit patients of minority demographics or neglect less common clinical practices. In this paper, we extend the classic GAN setup with an additional variational autoencoder (VAE) and include an external memory to replay latent features observed from the real samples to the GAN generator. Using antiretroviral therapy for human immunodeficiency virus (ART for HIV) as a case study, we show that our extended setup overcomes mode collapse and generates a synthetic dataset that accurately describes severely imbalanced class distributions commonly found in real-world clinical variables. In addition, we demonstrate that our synthetic dataset is associated with a very low patient disclosure risk, and that it retains a high level of utility from the ground truth dataset to support the development of downstream machine learning algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题