FEDKL：通过惩罚KL Divergence来解决联合加强学习中的数据异质性

论文标题

FEDKL：通过惩罚KL Divergence来解决联合加强学习中的数据异质性

FedKL: Tackling Data Heterogeneity in Federated Reinforcement Learning by Penalizing KL Divergence

论文作者

Xie, Zhijie, Song, S. H.

论文摘要

作为分布式学习范式，由于许多模型同步和聚合，联合学习（FL）面临通信瓶颈问题。异质数据通过导致缓慢的收敛性进一步恶化了情况。尽管已经广泛研究了数据异质性对监督FL的影响，但联合加固学习（FRL）的相关研究仍处于起步阶段。在本文中，我们首先定义了基于策略梯度的FRL系统的数据异质性的类型和水平。通过检查全球和地方目标功能之间的联系，我们证明本地培训可以使全球目标受益，如果本地和全球政策之间的总差异（TV）距离对本地更新进行了适当的惩罚。从本地政策中可以学习的全球政策的必要条件也被得出，这与异质性水平直接相关。基于理论结果，提出了基于Kullback-Leibler（KL）差异的惩罚，该惩罚与惩罚参数空间中模型差异的常规方法不同，直接限制了分布空间中的模型输出。还提供了所提出算法的收敛证明。通过将当地政策与全球政策的分歧共同惩罚全球惩罚，并以当地惩罚来限制本地培训的每一次迭代，拟议的方法可以在训练速度（步骤大小）和融合之间取得更好的权衡。两个流行的强化学习（RL）实验平台的实验结果证明了所提出的算法比现有方法在加速和稳定训练过程中使用异质数据的优势。

As a distributed learning paradigm, Federated Learning (FL) faces the communication bottleneck issue due to many rounds of model synchronization and aggregation. Heterogeneous data further deteriorates the situation by causing slow convergence. Although the impact of data heterogeneity on supervised FL has been widely studied, the related investigation for Federated Reinforcement Learning (FRL) is still in its infancy. In this paper, we first define the type and level of data heterogeneity for policy gradient based FRL systems. By inspecting the connection between the global and local objective functions, we prove that local training can benefit the global objective, if the local update is properly penalized by the total variation (TV) distance between the local and global policies. A necessary condition for the global policy to be learn-able from the local policy is also derived, which is directly related to the heterogeneity level. Based on the theoretical result, a Kullback-Leibler (KL) divergence based penalty is proposed, which, different from the conventional method that penalizes the model divergence in the parameter space, directly constrains the model outputs in the distribution space. Convergence proof of the proposed algorithm is also provided. By jointly penalizing the divergence of the local policy from the global policy with a global penalty and constraining each iteration of the local training with a local penalty, the proposed method achieves a better trade-off between training speed (step size) and convergence. Experiment results on two popular Reinforcement Learning (RL) experiment platforms demonstrate the advantage of the proposed algorithm over existing methods in accelerating and stabilizing the training process with heterogeneous data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题