Diwift：发现表格数据的实例有影响力的功能

论文标题

Diwift：发现表格数据的实例有影响力的功能

DIWIFT: Discovering Instance-wise Influential Features for Tabular Data

论文作者

Liu, Dugang, Cheng, Pengxiang, Zhu, Hong, Tang, Xing, Chen, Yanyu, Wang, Xiaoting, Pan, Weike, Ming, Zhong, He, Xiuqiang

论文摘要

表格数据是许多现实世界Web应用程序（例如零售，银行和电子商务）背后的最常见数据存储格式之一。这些Web应用程序的成功在很大程度上取决于使用的机器学习模型准确区分有影响力的功能与表格数据中所有预定功能的能力。从直觉上讲，在实际的业务场景中，不同的实例应与不同的有影响力的功能集相对应，并且同一实例的一系列有影响力的功能在不同的方案中可能会有所不同。但是，大多数现有的方法都集中在全球特征选择上，假设所有实例都具有相同的有影响力的功能集，并且考虑实例特征选择选择的几种方法忽略了不同场景中有影响力特征的可变性。在本文中，我们首先基于实例特征选择的影响力函数引入一个新的视角，并提供一些相应的理论见解，其核心是将影响函数用作衡量实例特征的重要性的指标。然后，我们提出了一种新解决方案，用于在表格数据（DIWIFT）中发现实例有影响力的特征，其中自我发项网络被用作特征选择模型，相应影响函数的值用作指导模型的优化目标。受益于影响函数的优势，即，其计算不取决于特定的体系结构，并且还可以考虑到不同情况下的数据分布，我们的Diwift具有更好的灵活性和鲁棒性。最后，我们对合成数据集和现实数据集进行了广泛的实验，以验证我们的轴心的有效性。

Tabular data is one of the most common data storage formats behind many real-world web applications such as retail, banking, and e-commerce. The success of these web applications largely depends on the ability of the employed machine learning model to accurately distinguish influential features from all the predetermined features in tabular data. Intuitively, in practical business scenarios, different instances should correspond to different sets of influential features, and the set of influential features of the same instance may vary in different scenarios. However, most existing methods focus on global feature selection assuming that all instances have the same set of influential features, and few methods considering instance-wise feature selection ignore the variability of influential features in different scenarios. In this paper, we first introduce a new perspective based on the influence function for instance-wise feature selection, and give some corresponding theoretical insights, the core of which is to use the influence function as an indicator to measure the importance of an instance-wise feature. We then propose a new solution for discovering instance-wise influential features in tabular data (DIWIFT), where a self-attention network is used as a feature selection model and the value of the corresponding influence function is used as an optimization objective to guide the model. Benefiting from the advantage of the influence function, i.e., its computation does not depend on a specific architecture and can also take into account the data distribution in different scenarios, our DIWIFT has better flexibility and robustness. Finally, we conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness of our DIWIFT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题