可证明具有结构化过渡的零和马尔可夫游戏的有效的虚拟游戏策略优化

论文标题

可证明具有结构化过渡的零和马尔可夫游戏的有效的虚拟游戏策略优化

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

论文作者

Qiu, Shuang, Wei, Xiaohan, Ye, Jieping, Wang, Zhaoran, Yang, Zhuoran

论文摘要

尽管固定环境中的单一机构政策优化最近引起了强化学习社区的大量研究关注，但是当在潜在竞争的环境中有多个代理商在玩耍时，理论上却少得多。我们通过提出和分析具有结构化但未知过渡的零和零马尔可夫游戏的新虚拟游戏策略优化算法来向前迈进一步。我们考虑两类的过渡结构：分类的独立过渡和单个控制器过渡。对于这两种情况，我们都证明了紧密的$ \ widetilde {\ Mathcal {o}}（\ sqrt {k}）$遗憾的界限在$ k $ eviefs in两者竞争性游戏场景中。每种代理人的遗憾是针对潜在的对抗对手的衡量，他们在观察完整的政策序列后可以在事后选择一个最佳政策。我们的算法在非平稳环境中同时进行政策优化的范围下，具有上置信度结合（UCB）的乐观和虚拟游戏的结合。当两个玩家都采用所提出的算法时，他们的总体最优差距为$ \ widetilde {\ mathcal {o}}（\ sqrt {k}）$。

While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{K})$ regret bounds after $K$ episodes in a two-agent competitive game scenario. The regret of each agent is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{K})$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题