自我播放PSRO：在两人零和游戏中迈向最佳人选

论文标题

自我播放PSRO：在两人零和游戏中迈向最佳人选

Self-Play PSRO: Toward Optimal Populations in Two-Player Zero-Sum Games

论文作者

McAleer, Stephen, Lanier, JB, Wang, Kevin, Baldi, Pierre, Fox, Roy, Sandholm, Tuomas

论文摘要

在具有竞争激烈的两种代理环境中，深入强化学习（RL）方法基于\ emph {double Oracle（do）}算法，例如\ emph {policy Space响应oracles（psro）}和\ emph {anytime psro（apsro）}，迭代，迭代地添加RL最佳响应policies。最终，这些人口策略的最佳混合物将近似于NASH平衡。但是，这些方法可能需要在收敛之前添加所有确定性策略。在这项工作中，我们介绍了\ emph {selfplay psro（sp-psro）}，这种方法可为每次迭代中的人群添加大致最佳的随机策略。 SP-PSRO并不仅对对手最不可利用的人口混合物添加确定性的最佳反应，而是学会了大致最佳的随机性政策，并将其添加到人群中。结果，SPSRO从经验上倾向于比APSRO快得多，而且在许多游戏中，仅在几次迭代中收敛。

In competitive two-agent environments, deep reinforcement learning (RL) methods based on the \emph{Double Oracle (DO)} algorithm, such as \emph{Policy Space Response Oracles (PSRO)} and \emph{Anytime PSRO (APSRO)}, iteratively add RL best response policies to a population. Eventually, an optimal mixture of these population policies will approximate a Nash equilibrium. However, these methods might need to add all deterministic policies before converging. In this work, we introduce \emph{Self-Play PSRO (SP-PSRO)}, a method that adds an approximately optimal stochastic policy to the population in each iteration. Instead of adding only deterministic best responses to the opponent's least exploitable population mixture, SP-PSRO also learns an approximately optimal stochastic policy and adds it to the population as well. As a result, SP-PSRO empirically tends to converge much faster than APSRO and in many games converges in just a few iterations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题