竞争政策优化

论文标题

竞争政策优化

Competitive Policy Optimization

论文作者

Prajapat, Manish, Azizzadenesheli, Kamyar, Liniger, Alexander, Yue, Yisong, Anandkumar, Anima

论文摘要

竞争性马尔可夫决策过程中政策优化的核心挑战是设计具有理想收敛性和稳定性属性的有效优化方法。为了解决这个问题，我们提出了竞争性政策优化（COPO），这是一种新颖的政策梯度方法，利用竞争游戏的游戏理论性质来推导政策更新。由竞争性梯度优化方法激励，我们得出了游戏目标的双线性近似。相反，现成的策略梯度方法仅利用线性近似，因此不会捕获玩家之间的互动。我们通过两种方式实例化COPO：（i）竞争性政策梯度和（ii）信任区域竞争政策优化。我们从理论上研究了这些方法，并在一系列全面但具有挑战性的游戏中进行了经验研究它们的行为。我们观察到，与基线策略梯度方法进行时，它们提供稳定的优化，融合了复杂策略以及更高的分数。

A core challenge in policy optimization in competitive Markov decision processes is the design of efficient optimization methods with desirable convergence and stability properties. To tackle this, we propose competitive policy optimization (CoPO), a novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates. Motivated by the competitive gradient optimization method, we derive a bilinear approximation of the game objective. In contrast, off-the-shelf policy gradient methods utilize only linear approximations, and hence do not capture interactions among the players. We instantiate CoPO in two ways:(i) competitive policy gradient, and (ii) trust-region competitive policy optimization. We theoretically study these methods, and empirically investigate their behavior on a set of comprehensive, yet challenging, competitive games. We observe that they provide stable optimization, convergence to sophisticated strategies, and higher scores when played against baseline policy gradient methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题