马尔可夫游戏的策略优化：统一框架和更快的融合

论文标题

马尔可夫游戏的策略优化：统一框架和更快的融合

Policy Optimization for Markov Games: Unified Framework and Faster Convergence

论文作者

Zhang, Runyu, Liu, Qinghua, Wang, Huan, Xiong, Caiming, Li, Na, Bai, Yu

论文摘要

本文研究了用于多机构增强学习的政策优化算法。我们首先在完整信息设置中提出一个针对两人零和零马尔可夫游戏的算法框架，其中每个迭代都使用某个矩阵游戏算法在每个状态下在每个状态下进行策略更新步骤，以及具有一定学习率的值更新步骤。该框架统一了许多现有和新的政策优化算法。我们表明，只要矩阵游戏算法在每个状态下达到低称重的遗憾，就该价值更新速度确定的权重，只要矩阵游戏算法在每个状态下具有低称重的遗憾。接下来，我们表明，该框架与每个状态的乐观跟随量指定算法（OFTRL）算法（和平滑的价值更新）可以找到一个$ \ Mathcal {\ widetilde {\ widetilde {o}}}}（t^{ - 5/6}）$在$ t $ ITERAVE和类似的Algorith Inifation Algorith y in y Neifultial and Algorith in a Algorith in and Algorith y in and Algorith中的fivemmitife， $ \ Mathcal {\ widetilde {o}}}（t^{ - 1}）$收敛率。这些改进了当前最佳$ \ Mathcal {\ widetilde {o}}}（t^{ - 1/2}）$对称策略优化类型算法的速率。我们还将此算法扩展到多玩家通用-SUM Markov游戏，并显示$ \ Mathcal {\ widetilde {o}}}}（t^{ - 3/4}）$收敛率与粗相关均衡（CCE）。最后，我们提供了一个数字示例来验证我们的理论并研究平滑价值更新的重要性，并发现使用“渴望”的价值更新（相当于独立的自然策略梯度算法）也可能会大大减慢收敛性，即使在简单的游戏中，$ h = 2 $ layers。

This paper studies policy optimization algorithms for multi-agent reinforcement learning. We begin by proposing an algorithm framework for two-player zero-sum Markov Games in the full-information setting, where each iteration consists of a policy update step at each state using a certain matrix game algorithm, and a value update step with a certain learning rate. This framework unifies many existing and new policy optimization algorithms. We show that the state-wise average policy of this algorithm converges to an approximate Nash equilibrium (NE) of the game, as long as the matrix game algorithms achieve low weighted regret at each state, with respect to weights determined by the speed of the value updates. Next, we show that this framework instantiated with the Optimistic Follow-The-Regularized-Leader (OFTRL) algorithm at each state (and smooth value updates) can find an $\mathcal{\widetilde{O}}(T^{-5/6})$ approximate NE in $T$ iterations, and a similar algorithm with slightly modified value update rule achieves a faster $\mathcal{\widetilde{O}}(T^{-1})$ convergence rate. These improve over the current best $\mathcal{\widetilde{O}}(T^{-1/2})$ rate of symmetric policy optimization type algorithms. We also extend this algorithm to multi-player general-sum Markov Games and show an $\mathcal{\widetilde{O}}(T^{-3/4})$ convergence rate to Coarse Correlated Equilibria (CCE). Finally, we provide a numerical example to verify our theory and investigate the importance of smooth value updates, and find that using "eager" value updates instead (equivalent to the independent natural policy gradient algorithm) may significantly slow down the convergence, even on a simple game with $H=2$ layers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题