数据集移位诊断的统一框架

论文标题

数据集移位诊断的统一框架

A unified framework for dataset shift diagnostics

论文作者

Polo, Felipe Maia, Izbicki, Rafael, Lacerda Jr, Evanildo Gomes, Ibieta-Jimenez, Juan Pablo, Vicente, Renato

论文摘要

监督的学习技术通常假设培训数据源自目标人群。然而，实际上，数据集的转移经常出现，如果不充分考虑到，则可能会降低其预测变量的性能。在这项工作中，我们提出了一个名为“ detectShift”的新颖而灵活的框架，该框架量化和测试多个数据集偏移，包括$（x，y）$，$ x $，$ y $，$ x | y $，$ x | y $和$ y | y | x $的分布的变化。检测Shift将从业人员洞悉数据偏移，使用源和目标数据促进预测变量的适应或重新培训。当目标域中的标记样品受到限制时，这证明非常有价值。该框架利用具有相同性质的测试统计量来量化各种变化的幅度，从而使结果更容易解释。它具有多功能性，适合回归和分类任务，并适应各种数据表格 - 表格，文本或图像。实验结果表明，即使在较高的维度中，检测缩合在检测数据集变化中的有效性也是如此。

Supervised learning techniques typically assume training data originates from the target population. Yet, in reality, dataset shift frequently arises, which, if not adequately taken into account, may decrease the performance of their predictors. In this work, we propose a novel and flexible framework called DetectShift that quantifies and tests for multiple dataset shifts, encompassing shifts in the distributions of $(X, Y)$, $X$, $Y$, $X|Y$, and $Y|X$. DetectShift equips practitioners with insights into data shifts, facilitating the adaptation or retraining of predictors using both source and target data. This proves extremely valuable when labeled samples in the target domain are limited. The framework utilizes test statistics with the same nature to quantify the magnitude of the various shifts, making results more interpretable. It is versatile, suitable for regression and classification tasks, and accommodates diverse data forms - tabular, text, or image. Experimental results demonstrate the effectiveness of DetectShift in detecting dataset shifts even in higher dimensions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题