论文标题
合成数据保险库中的顺序模型
Sequential Models in the Synthetic Data Vault
论文作者
论文摘要
本文的目的是描述一个用于在合成数据库中生成合成顺序数据的系统。为了实现这一目标,我们在SDV中介绍了当前的顺序模型,SDV是一个端到端框架,该框架为多序列,现实世界数据构建生成模型。这包括一个新型的基于神经网络的机器学习模型,条件概率自动回归(CPAR)模型。总体系统和模型可在开源合成数据保险库(SDV)库中获得{https://github.com/sdv-dev/sdv},以及用于不同合成数据需求的其他多种模型。 构建顺序SDV后,我们使用它来生成合成数据,并将其质量与现有的非序列生成对抗网络的模型进行了比较。为了将顺序合成数据与其实际对应物进行比较,我们发明了一个称为多序列汇总相似性(MSA)的新指标。我们用它来得出结论,我们的顺序SDV模型比非综合数据质量的任何权衡取舍都学到了更高的水平模式。
The goal of this paper is to describe a system for generating synthetic sequential data within the Synthetic data vault. To achieve this, we present the Sequential model currently in SDV, an end-to-end framework that builds a generative model for multi-sequence, real-world data. This includes a novel neural network-based machine learning model, conditional probabilistic auto-regressive (CPAR) model. The overall system and the model is available in the open source Synthetic Data Vault (SDV) library {https://github.com/sdv-dev/SDV}, along with a variety of other models for different synthetic data needs. After building the Sequential SDV, we used it to generate synthetic data and compared its quality against an existing, non-sequential generative adversarial network based model called CTGAN. To compare the sequential synthetic data against its real counterpart, we invented a new metric called Multi-Sequence Aggregate Similarity (MSAS). We used it to conclude that our Sequential SDV model learns higher level patterns than non-sequential models without any trade-offs in synthetic data quality.