通过学习VAE中的干净子空间来修复系统的异常值

论文标题

通过学习VAE中的干净子空间来修复系统的异常值

Repairing Systematic Outliers by Learning Clean Subspaces in VAEs

论文作者

Eduardo, Simao, Xu, Kai, Nazabal, Alfredo, Sutton, Charles

论文摘要

数据清洁通常包括异常检测和数据修复。系统误差是由于数据反复发生的几乎确定性转换而导致的，例如特定的图像像素设置为默认值或水印。因此，容量足够的模型很容易置于这些错误，从而难以检测和修复。作为系统的离群值是干净实例和系统误差模式的模式的组合，我们的主要见解是，嵌入者可以通过模型中的较小的表示形式（子空间）来建模，而不是异常值。通过利用这一点，我们提出了清洁子空间变分自动编码器（CLSVAE），这是一种新型的半监督模型，用于检测和自动化系统误差。主要思想是分别分别划分潜在空间和模型较大和离群模式。与以前的相关模型相比，CLSVAE的有效数据少得多，通常不到2％的数据。我们在具有不同级别的损坏和标记的集合大小的方案中使用三个图像数据集提供实验，与相关基线相比。 CLSVAE提供了不干预的高级维修，例如与最接近的基线相比，只有标记数据的0.25％的相对误差下降了58％。

Data cleaning often comprises outlier detection and data repair. Systematic errors result from nearly deterministic transformations that occur repeatedly in the data, e.g. specific image pixels being set to default values or watermarks. Consequently, models with enough capacity easily overfit to these errors, making detection and repair difficult. Seeing as a systematic outlier is a combination of patterns of a clean instance and systematic error patterns, our main insight is that inliers can be modelled by a smaller representation (subspace) in a model than outliers. By exploiting this, we propose Clean Subspace Variational Autoencoder (CLSVAE), a novel semi-supervised model for detection and automated repair of systematic errors. The main idea is to partition the latent space and model inlier and outlier patterns separately. CLSVAE is effective with much less labelled data compared to previous related models, often with less than 2% of the data. We provide experiments using three image datasets in scenarios with different levels of corruption and labelled set sizes, comparing to relevant baselines. CLSVAE provides superior repairs without human intervention, e.g. with just 0.25% of labelled data we see a relative error decrease of 58% compared to the closest baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题