论文标题
FACMAC:考虑多代理集中式政策梯度
FACMAC: Factored Multi-Agent Centralised Policy Gradients
论文作者
论文摘要
我们提出了有规模的多代理集中策略梯度(FACMAC),这是一种在离散和连续行动空间中合作多代理增强学习的新方法。像流行的多代理参与者方法MADDPG一样,我们的方法使用深层确定性的政策梯度来学习政策。但是,FACMAC学习了一位集中但有方面的评论家,该评论家通过非线性单调函数将每个代理实用程序结合到联合动作值函数中,就像QMIX一样,QMIX是一种流行的多代理Q-学习算法。但是,与QMIX不同,对批评家没有任何固有的约束。因此,我们还采用了非单调分解,并从经验上证明,其增加的代表能力使其能够解决某些无法用单片或单调方面的批评者解决的任务。此外,FACMAC使用一个集中的策略梯度估计器,该估计器在整个联合行动空间中进行了优化,而不是像MADDPG那样在每个代理商的动作空间上进行优化。这允许更加协调的政策变更,并充分收获集中批评家的好处。我们评估了facmac在多代理粒子环境的变体上,一种新型的多代理穆约科克基准以及一组具有挑战性的Starcraft II微观管理任务。经验结果表明,FACMAC在所有三个领域上都超过MADDPG和其他基线。
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC), a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces. Like MADDPG, a popular multi-agent actor-critic method, our approach uses deep deterministic policy gradients to learn policies. However, FACMAC learns a centralised but factored critic, which combines per-agent utilities into the joint action-value function via a non-linear monotonic function, as in QMIX, a popular multi-agent Q-learning algorithm. However, unlike QMIX, there are no inherent constraints on factoring the critic. We thus also employ a nonmonotonic factorisation and empirically demonstrate that its increased representational capacity allows it to solve some tasks that cannot be solved with monolithic, or monotonically factored critics. In addition, FACMAC uses a centralised policy gradient estimator that optimises over the entire joint action space, rather than optimising over each agent's action space separately as in MADDPG. This allows for more coordinated policy changes and fully reaps the benefits of a centralised critic. We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks. Empirical results demonstrate FACMAC's superior performance over MADDPG and other baselines on all three domains.