优势遗憾的是匹配演员 - 批评

论文标题

优势遗憾的是匹配演员 - 批评

The Advantage Regret-Matching Actor-Critic

论文作者

Gruslys, Audrūnas, Lanctot, Marc, Munos, Rémi, Timbers, Finbarr, Schmid, Martin, Perolat, Julien, Morrill, Dustin, Zambaldi, Vinicius, Lespiau, Jean-Baptiste, Schultz, John, Azar, Mohammad Gheshlaghi, Bowling, Michael, Tuyls, Karl

论文摘要

遗憾的最小化在在线学习，游戏中的均衡计算和增强学习（RL）中发挥了关键作用。在本文中，我们描述了一种基于重复重新考虑过去行为的无重力学习的通用RL方法。我们提出了一种无模型的RL算法，即AdvantageRegret匹配的参与者 - 批评（ARMAC）：ARMAC并没有保存过去的国家行动数据，而是保存了过去政策的缓冲区，通过它们重建了过去的行为的印象评估。这些回顾性价值估计用于预测条件优势，这些优势与遗憾匹配相结合，产生了新的政策。特别是，Armac从集中式训练环境中从采样轨迹中学习，而无需应用在蒙特卡洛反事实遗憾（CFR）最小化中常用的重要性采样；因此，在大环境中，它不会遭受过度差异。在单一机构的环境中，ARMAC通过保持过去的政策完好无损地展示了一种有趣的探索形式。在多基因环境中，自我播放中的ARMAC在某些部分可观察的零和基准测试中接近NASH平衡。我们在明显更大的投注未限制的德克萨斯州Hold'em的游戏中提供可剥削性估计。

Regret minimization has played a key role in online learning, equilibrium computation in games, and reinforcement learning (RL). In this paper, we describe a general model-free RL method for no-regret learning based on repeated reconsideration of past behavior. We propose a model-free RL algorithm, the AdvantageRegret-Matching Actor-Critic (ARMAC): rather than saving past state-action data, ARMAC saves a buffer of past policies, replaying through them to reconstruct hindsight assessments of past behavior. These retrospective value estimates are used to predict conditional advantages which, combined with regret matching, produces a new policy. In particular, ARMAC learns from sampled trajectories in a centralized training setting, without requiring the application of importance sampling commonly used in Monte Carlo counterfactual regret (CFR) minimization; hence, it does not suffer from excessive variance in large environments. In the single-agent setting, ARMAC shows an interesting form of exploration by keeping past policies intact. In the multiagent setting, ARMAC in self-play approaches Nash equilibria on some partially-observable zero-sum benchmarks. We provide exploitability estimates in the significantly larger game of betting-abstracted no-limit Texas Hold'em.

下载PDF全文

下载文献需遵守相关版权规定

论文标题