在非平稳性下分散PPO的信任区域范围

论文标题

在非平稳性下分散PPO的信任区域范围

Trust Region Bounds for Decentralized PPO Under Non-stationarity

论文作者

Sun, Mingfei, Devlin, Sam, Beck, Jacob, Hofmann, Katja, Whiteson, Shimon

论文摘要

我们提出了信任区域的范围，以优化合作多机构增强学习（MARL）中的分散政策，即使过渡动态是非平稳的，也存在。这项新分析提供了对MARL最近两种参与者批评方法的强烈表现的理论理解，这些方法都依赖于独立比率，即分别针对每个代理商的策略计算概率比率。我们表明，尽管独立比率引起的非平稳性，但由于在所有分散政策上执行信任区域的约束，仍然会产生单调的改进保证。我们还可以根据培训中的代理数量来界定独立比率，从而以原则性的方式有效地执行这种信任区域约束，从而为近端剪辑提供了理论基础。最后，我们的经验结果支持以下假设：IPPO和MAPPO的强大性能是通过剪辑集中式培训来实施这种信任区域约束的直接结果，并根据我们的理论分析预测，对代理的数量进行调整。

We present trust region bounds for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and tuning the hyperparameters with regards to the number of agents, as predicted by our theoretical analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题