分散网络系统的基于模型的基于模型的策略优化

论文标题

分散网络系统的基于模型的基于模型的策略优化

Scalable Model-based Policy Optimization for Decentralized Networked Systems

论文作者

Du, Yali, Ma, Chengdong, Liu, Yuchen, Lin, Runji, Dong, Hao, Wang, Jun, Yang, Yaodong

论文摘要

增强学习算法需要大量样品；这通常会限制他们的现实应用程序在简单的任务上。在多代理任务中，这种挑战更为出色，因为操作的每个步骤都需要昂贵，需要沟通，转移或资源。这项工作旨在通过基于模型的学习来提高多代理控制的数据效率。我们考虑代理人合作并仅与邻居进行当地交流的网络系统，并提出了基于模型的政策优化框架（DMPO）。在我们的方法中，每个代理都会学习一个动态模型，以预测未来状态并通过通信广播其预测，然后在模型推出下训练策略。为了减轻模型生成的数据的偏差，我们限制了用于产生近视推出的模型用法，从而减少了模型生成的复合误差。为了与策略更新的独立性有关，我们介绍了扩展的价值函数，从理论上则证明了由此产生的策略梯度是与真实策略梯度的紧密近似。我们对智能运输系统的几个基准测试评估了算法，这些智能运输系统是连接的自动驾驶汽车控制任务（FLOW和CACC）和自适应交通信号控制（ATSC）。经验上的结果表明，我们的方法可实现出色的数据效率，并使用真实模型匹配无模型方法的性能。

Reinforcement learning algorithms require a large amount of samples; this often limits their real-world applications on even simple tasks. Such a challenge is more outstanding in multi-agent tasks, as each step of operation is more costly requiring communications or shifting or resources. This work aims to improve data efficiency of multi-agent control by model-based learning. We consider networked systems where agents are cooperative and communicate only locally with their neighbors, and propose the decentralized model-based policy optimization framework (DMPO). In our method, each agent learns a dynamic model to predict future states and broadcast their predictions by communication, and then the policies are trained under the model rollouts. To alleviate the bias of model-generated data, we restrain the model usage for generating myopic rollouts, thus reducing the compounding error of model generation. To pertain the independence of policy update, we introduce extended value function and theoretically prove that the resulting policy gradient is a close approximation to true policy gradients. We evaluate our algorithm on several benchmarks for intelligent transportation systems, which are connected autonomous vehicle control tasks (Flow and CACC) and adaptive traffic signal control (ATSC). Empirically results show that our method achieves superior data efficiency and matches the performance of model-free methods using true models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题