论文标题

复杂数据的生成建模

Generative Modeling of Complex Data

论文作者

Canale, Luca, Grislain, Nicolas, Lothe, Grégoire, Leduc, Johan

论文摘要

近年来,几种模型提高了生成合成表格数据集的能力。但是,这样的模型着重于合成简单的柱状表,并且在具有复杂结构的现实生活数据上不可用。本文提出了一个通用框架,可以通过复合和嵌套类型合成更复杂的数据结构。然后,它提出了一种实用的实现,该实施是为struct(类型映射)和列表(类型的重复实例)构建的。标准基准数据集的结果表明,这种实现在机器学习实用程序和统计相似性方面始终优于当前最新模型。此外,它在两个具有多个嵌套和稀疏数据的复杂分层数据集上显示出非常强烈的结果,这些数据以前是遥不可及的。

In recent years, several models have improved the capacity to generate synthetic tabular datasets. However, such models focus on synthesizing simple columnar tables and are not useable on real-life data with complex structures. This paper puts forward a generic framework to synthesize more complex data structures with composite and nested types. It then proposes one practical implementation, built with causal transformers, for struct (mappings of types) and lists (repeated instances of a type). The results on standard benchmark datasets show that such implementation consistently outperforms current state-of-the-art models both in terms of machine learning utility and statistical similarity. Moreover, it shows very strong results on two complex hierarchical datasets with multiple nesting and sparse data, that were previously out of reach.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源