论文标题

自我播放PSRO:在两人零和游戏中迈向最佳人选

Self-Play PSRO: Toward Optimal Populations in Two-Player Zero-Sum Games

论文作者

McAleer, Stephen, Lanier, JB, Wang, Kevin, Baldi, Pierre, Fox, Roy, Sandholm, Tuomas

论文摘要

在具有竞争激烈的两种代理环境中,深入强化学习(RL)方法基于\ emph {double Oracle(do)}算法,例如\ emph {policy Space响应oracles(psro)}和\ emph {anytime psro(apsro)},迭代,迭代地添加RL最佳响应policies。最终,这些人口策略的最佳混合物将近似于NASH平衡。但是,这些方法可能需要在收敛之前添加所有确定性策略。在这项工作中,我们介绍了\ emph {selfplay psro(sp-psro)},这种方法可为每次迭代中的人群添加大致最佳的随机策略。 SP-PSRO并不仅对对手最不可利用的人口混合物添加确定性的最佳反应,而是学会了大致最佳的随机性政策,并将其添加到人群中。结果,SPSRO从经验上倾向于比APSRO快得多,而且在许多游戏中,仅在几次迭代中收敛。

In competitive two-agent environments, deep reinforcement learning (RL) methods based on the \emph{Double Oracle (DO)} algorithm, such as \emph{Policy Space Response Oracles (PSRO)} and \emph{Anytime PSRO (APSRO)}, iteratively add RL best response policies to a population. Eventually, an optimal mixture of these population policies will approximate a Nash equilibrium. However, these methods might need to add all deterministic policies before converging. In this work, we introduce \emph{Self-Play PSRO (SP-PSRO)}, a method that adds an approximately optimal stochastic policy to the population in each iteration. Instead of adding only deterministic best responses to the opponent's least exploitable population mixture, SP-PSRO also learns an approximately optimal stochastic policy and adds it to the population as well. As a result, SP-PSRO empirically tends to converge much faster than APSRO and in many games converges in just a few iterations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源