基于模型的多代理RL在零和马尔可夫游戏中具有近乎最佳样本的复杂性

论文标题

基于模型的多代理RL在零和马尔可夫游戏中具有近乎最佳样本的复杂性

Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

论文作者

Zhang, Kaiqing, Kakade, Sham M., Başar, Tamer, Yang, Lin F.

论文摘要

基于模型的增强学习（RL）使用经验模型找到了最佳策略，长期以来一直被认为是RL的角石之一。它特别适合多代理RL（MARL），因为它自然地解开了学习和计划阶段，并且当所有代理人都使用样本同时改善其策略时避免了非平稳性问题。尽管直观且广泛使用，但基于模型的MARL算法的样本复杂性尚未得到充分研究。在本文中，我们的目标是解决有关其样本复杂性的基本问题。我们可以说是最基本的MARL设置：两人打折的零和马尔可夫游戏，只有对生成模型的访问。我们表明，基于模型的MARL达到了$ \ tilde o的样本复杂性（| s || a || a || b |（1-γ）^{ - 3}ε^{ - 2}）$，以查找Nash equilibirium（NE）价值（NE）的价值，最多可达一些$ε$，并且具有$ε$ $ - $ - $ - $ - $ - $ - $ - $ - $ - $ quip的$γ，an空间和两个代理的动作空间。我们进一步表明，如果该算法是奖励 - 不可思议的，则这种样品结合是最小的最佳选择（到对数因素），其中算法查询状态过渡样本而没有奖励知识，通过建立匹配的下限来建立奖励知识。这与通常的奖励意识设置相反，$ \tildeΩ（| s |（| a |+| b |）（1-γ）^{ - 3}ε^{ - 2}）$下限，这种基于模型的方法几乎是最佳的，只有$ | a | a | a | b | b | $依赖性的空白。我们的结果不仅证明了MARL基于模型的基本方法的样本效率，而且还详细介绍了其功率之间的基本权衡（轻松处理更具挑战性的奖励态度不足的案例）和限制（在$ | a |，| b | B | $较少的适应性和次优）中，尤其是在多元素背景下出现的。

Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of $\tilde O(|S||A||B|(1-γ)^{-3}ε^{-2})$ for finding the Nash equilibrium (NE) value up to some $ε$ error, and the $ε$-NE policies with a smooth planning oracle, where $γ$ is the discount factor, and $S,A,B$ denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a $\tildeΩ(|S|(|A|+|B|)(1-γ)^{-3}ε^{-2})$ lower bound, where this model-based approach is near-optimal with only a gap on the $|A|,|B|$ dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in $|A|,|B|$), particularly arises in the multi-agent context.

下载PDF全文

下载文献需遵守相关版权规定

论文标题