论文标题
对多代理强化学习中的国家批评家的深入了解
A Deeper Understanding of State-Based Critics in Multi-Agent Reinforcement Learning
论文作者
论文摘要
分散执行的集中培训,以集中式离线方式进行培训,已成为多代理增强学习方面的流行解决方案范式。许多这样的方法采用基于州的批评家的参与者批评者的形式,因为集中式培训允许访问真实的系统状态,尽管在执行时不可用,但在培训期间可能很有用。基于州的批评家已成为一种常见的经验选择,尽管这是有限的理论理由或分析。在本文中,我们表明,基于州的批评家可以在政策梯度估计中引入偏见,从而破坏算法的渐近保证。我们还表明,即使国家批评家不引入任何偏见,它们仍然会导致更大的梯度差异,这与共同的直觉相反。最后,我们通过将不同形式的集中批评家在广泛的共同基准上进行比较,并详细介绍了各种环境特性与不同类型的批评家的有效性如何相关的,从而展示了这些理论的影响。
Centralized Training for Decentralized Execution, where training is done in a centralized offline fashion, has become a popular solution paradigm in Multi-Agent Reinforcement Learning. Many such methods take the form of actor-critic with state-based critics, since centralized training allows access to the true system state, which can be useful during training despite not being available at execution time. State-based critics have become a common empirical choice, albeit one which has had limited theoretical justification or analysis. In this paper, we show that state-based critics can introduce bias in the policy gradient estimates, potentially undermining the asymptotic guarantees of the algorithm. We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition. Finally, we show the effects of the theories in practice by comparing different forms of centralized critics on a wide range of common benchmarks, and detail how various environmental properties are related to the effectiveness of different types of critics.