管道PSRO：一种可扩展的方法，用于在大型游戏中找到近似NASH平衡

论文标题

管道PSRO：一种可扩展的方法，用于在大型游戏中找到近似NASH平衡

Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games

论文作者

McAleer, Stephen, Lanier, John, Fox, Roy, Baldi, Pierre

论文摘要

当信息状态数量较大时，在零和不完美信息游戏中找到近似NASH的平衡游戏具有挑战性。政策空间响应甲骨文（PSRO）是一种基于游戏理论的深入增强学习算法，可以保证将其融合到近似的NASH平衡中。但是，PSRO需要在每次迭代中培训强化学习政策，这使大型游戏的速度太慢。我们通过反例和实验表明DCH和整流PSRO，这两种现有的扩展PSRO的方法，即使在小型游戏中也无法收敛。我们介绍了Pipeline PSRO（P2SRO），这是第一个可扩展的一般方法，用于在大型零和不完美信息游戏中找到近似NASH平衡。 P2SRO能够通过维持强化学习工人的层次结构管道来使PSRO与融合保证，这是针对层次结构中较低级别产生的政策的每个培训。我们表明，与现有方法不同，P2SRO会收敛到近似的NASH平衡，并且在各种不完美的信息游戏中，随着平行工人的数量增加，它的速度如此之快。我们还引入了barrage Stratego的开源环境，该环境是一种大约$ 10^{50} $的Stratego的变体。 P2Sro能够在弹幕策略上实现最先进的性能，并击败所有现有的机器人。实验代码可用Athttps：//github.com/jblanier/pipeline-psro。

Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable general method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of $10^{50}$. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots. Experiment code is available athttps://github.com/JBLanier/pipeline-psro.

下载PDF全文

下载文献需遵守相关版权规定

论文标题