稳定，高效的基于沙普利价值的奖励重新分配，用于多代理增强自动驾驶汽车学习

论文标题

稳定，高效的基于沙普利价值的奖励重新分配，用于多代理增强自动驾驶汽车学习

Stable and Efficient Shapley Value-Based Reward Reallocation for Multi-Agent Reinforcement Learning of Autonomous Vehicles

论文作者

Han, Songyang, Wang, He, Su, Sanbao, Shi, Yuanyuan, Miao, Fei

论文摘要

随着网络网络物理系统（CPSS）中的传感和通信技术的发展，基于多代理的增强学习方法（MARL）的方法已整合到物理系统的控制过程中，并在CPS范围内表现出突出的CPS域中的突出性能，例如连接的自动驾驶汽车（CAVS）。但是，数学上表征通过沟通和合作能力来提高骑士表现的表征仍然是一项挑战。当每种自动驾驶汽车最初都是自动利益时，我们不能假设所有代理商在训练过程中都会自然合作。在这项工作中，我们建议重新分配系统的全部奖励，以激励自动驾驶汽车之间的稳定合作。我们正式定义并量化了如何在拟议的可转让公用事业游戏下对系统的总奖励重新分配给每个代理的总奖励，以便多代理之间基于通信的合作会增加系统的总奖励。我们证明，如果可转移的公用事业游戏是凸的游戏，则MARL的基于Shapley价值的奖励重新分配将位于核心中。因此，合作是稳定而有效的，代理人应留在联盟或合作小组中。然后，我们建议使用Shapley Value Reward Reallosion的合作政策学习算法。在实验中，与几种文献算法相比，我们使用所提出的算法显示了CAV系统的平均发作系统奖励的改进。

With the development of sensing and communication technologies in networked cyber-physical systems (CPSs), multi-agent reinforcement learning (MARL)-based methodologies are integrated into the control process of physical systems and demonstrate prominent performance in a wide array of CPS domains, such as connected autonomous vehicles (CAVs). However, it remains challenging to mathematically characterize the improvement of the performance of CAVs with communication and cooperation capability. When each individual autonomous vehicle is originally self-interest, we can not assume that all agents would cooperate naturally during the training process. In this work, we propose to reallocate the system's total reward efficiently to motivate stable cooperation among autonomous vehicles. We formally define and quantify how to reallocate the system's total reward to each agent under the proposed transferable utility game, such that communication-based cooperation among multi-agents increases the system's total reward. We prove that Shapley value-based reward reallocation of MARL locates in the core if the transferable utility game is a convex game. Hence, the cooperation is stable and efficient and the agents should stay in the coalition or the cooperating group. We then propose a cooperative policy learning algorithm with Shapley value reward reallocation. In experiments, compared with several literature algorithms, we show the improvement of the mean episode system reward of CAV systems using our proposed algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题