在样本非IID数据上联合XGBoost

论文标题

在样本非IID数据上联合XGBoost

Federated XGBoost on Sample-Wise Non-IID Data

论文作者

Jones, Katelinh, Ong, Yuya Jeremy, Zhou, Yi, Baracaldo, Nathalie

论文摘要

联合学习（FL）是以分散的方式共同训练机器学习算法的范式，它允许各方与聚合者进行交流以创建和训练模型，而无需公开参与培训过程的本地当事方的原始数据分布。 FL中的大多数研究都集中在基于神经网络的方法上，但是，由于克服算法的迭代和添加性特征的挑战，在联合学习中基于XGBoost的方法（例如XGBOOST）在联邦学习中没有得到反应。基于决策树的模型，尤其是XGBoost，可以处理非IID数据，这对于联合学习框架中使用的算法很重要，因为数据的基本特征是分散的，并且具有本质上非IID的风险。在本文中，我们专注于调查通过对各种基于样本量的数据偏斜方案进行实验以及这些模型在各种非IID方案下的性能，通过非IID分布的影响如何受到非IID分布的影响。我们在多个不同的数据集中进行了一组广泛的实验，并进行了不同的数据偏斜分区。我们的实验结果表明，尽管有各种分区比率，但模型的性能保持一致，并且与以集中式方式训练的模型相对于模型的表现保持良好或同样的良好。

Federated Learning (FL) is a paradigm for jointly training machine learning algorithms in a decentralized manner which allows for parties to communicate with an aggregator to create and train a model, without exposing the underlying raw data distribution of the local parties involved in the training process. Most research in FL has been focused on Neural Network-based approaches, however Tree-Based methods, such as XGBoost, have been underexplored in Federated Learning due to the challenges in overcoming the iterative and additive characteristics of the algorithm. Decision tree-based models, in particular XGBoost, can handle non-IID data, which is significant for algorithms used in Federated Learning frameworks since the underlying characteristics of the data are decentralized and have risks of being non-IID by nature. In this paper, we focus on investigating the effects of how Federated XGBoost is impacted by non-IID distributions by performing experiments on various sample size-based data skew scenarios and how these models perform under various non-IID scenarios. We conduct a set of extensive experiments across multiple different datasets and different data skew partitions. Our experimental results demonstrate that despite the various partition ratios, the performance of the models stayed consistent and performed close to or equally well against models that were trained in a centralized manner.

下载PDF全文

下载文献需遵守相关版权规定

论文标题