论文标题
可证明的自我游戏算法用于竞争性强化学习
Provable Self-Play Algorithms for Competitive Reinforcement Learning
论文作者
论文摘要
自我播放,算法通过在不需要任何直接监督的情况下对自己进行对抗来学习的地方已成为现代强化学习(RL)的新武器,以实现实践中的超人表现。但是,大多数在加强学习中的易位理论仅适用于代理在固定环境中发挥作用的环境。自我游戏算法是否可以证明是有效的,尤其是在管理勘探/剥削折衷的情况下,它仍然在很大程度上开放。我们在马尔可夫游戏的设置下研究了竞争性强化学习中的自我玩法,马尔可夫决策过程对两人案例的概括。我们介绍了一种自我播放算法---有价值的迭代,并具有上/下置信度(VI-ULCB)---并表明它使遗憾的是$ \ tilde {\ Mathcal {o}}(\ sqrt {t}} $代理在\ emph {any}步骤中的策略。我们还引入了探索 - 探索风格的算法,该算法对$ \ tilde {\ Mathcal {o}}}(t^{2/3})$的遗憾稍差,但即使在最坏的情况下,也可以保证在多项式时间内运行。据我们所知,我们的工作介绍了用于竞争性强化学习的可证明样品有效的自我播放算法的第一行。
Self-play, where the algorithm learns by playing against itself without requiring any direct supervision, has become the new weapon in modern Reinforcement Learning (RL) for achieving superhuman performance in practice. However, the majority of exisiting theory in reinforcement learning only applies to the setting where the agent plays against a fixed environment; it remains largely open whether self-play algorithms can be provably effective, especially when it is necessary to manage the exploration/exploitation tradeoff. We study self-play in competitive reinforcement learning under the setting of Markov games, a generalization of Markov decision processes to the two-player case. We introduce a self-play algorithm---Value Iteration with Upper/Lower Confidence Bound (VI-ULCB)---and show that it achieves regret $\tilde{\mathcal{O}}(\sqrt{T})$ after playing $T$ steps of the game, where the regret is measured by the agent's performance against a \emph{fully adversarial} opponent who can exploit the agent's strategy at \emph{any} step. We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret of $\tilde{\mathcal{O}}(T^{2/3})$, but is guaranteed to run in polynomial time even in the worst case. To the best of our knowledge, our work presents the first line of provably sample-efficient self-play algorithms for competitive reinforcement learning.