自我玩法的近乎最佳的加固学习

论文标题

自我玩法的近乎最佳的加固学习

Near-Optimal Reinforcement Learning with Self-Play

论文作者

Bai, Yu, Jin, Chi, Yu, Tiancheng

论文摘要

本文考虑了设计最佳算法在两人零和游戏中设计最佳算法的问题。我们专注于自我播放算法，这些算法通过在没有任何直接监督的情况下对抗自己来学习最佳政策。在带有$ s $状态的表格情节马尔可夫游戏中，$ a $ max-player动作和$ b $最小的播放器动作，找到近似nash平衡的最佳现有算法需要$ \ tilde {\ tilde {\ mathcal {o}}（o}}（o}}（o相比之下，最好的现有下限尺度为$ω（s（a+b））$，并且从上限有很大的差距。本文首次缩小了这一差距：我们提出了带有样本复杂性的\ emph {nash q-arearning}算法的乐观变体，$ \ tilde {\ tilde {\ nathcal {o}}（sab）$，以及新的\ emph {nash v-learning} algorithm and sample algorithm $ \ tilde {\ Mathcal {o}}（s（a+b））$。后一个结果与所有与问题有关的参数中的信息理论下限匹配，除了每个情节长度的多项式因子。此外，我们提出了一个计算硬度结果，用于学习对马尔可夫游戏中固定对手的最佳响应 - 一个学习目标与找到NASH平衡不同。

This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with $S$ states, $A$ max-player actions and $B$ min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires $\tilde{\mathcal{O}}(S^2AB)$ steps of game playing, when only highlighting the dependency on $(S,A,B)$. In contrast, the best existing lower bound scales as $Ω(S(A+B))$ and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the \emph{Nash Q-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(SAB)$, and a new \emph{Nash V-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(S(A+B))$. The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games---a learning objective different from finding the Nash equilibrium.

下载PDF全文

下载文献需遵守相关版权规定

论文标题