加权QMIX：扩大深度多代理增强学习的单调值函数分解

论文标题

加权QMIX：扩大深度多代理增强学习的单调值函数分解

Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

论文作者

Rashid, Tabish, Farquhar, Gregory, Peng, Bei, Whiteson, Shimon

论文摘要

QMIX是一种流行的$ Q $ - 学习算法，用于集中式培训和分散执行范式中的合作MARL。为了实现易于权力的权力化，QMIX限制了可以代表的联合行动$ q $值为每个代理商的单调混合。但是，这种限制阻止其表示值函数，在这种限制中，代理对其行为的排序可以取决于其他代理的行动。为了分析此表示限制，我们首先将目标QMIX优化正式化，这使我们可以将QMIX视为运算符，该操作员首先计算$ q $ - 实现目标，然后将其投影到可由qmix表示的空间中。该预测返回一个可代表性的$ q $价值，可最大程度地减少所有联合动作中未加权的平方错误。我们特别表明，即使访问$ q^*$，这一预测也可能无法恢复最佳策略，这主要源于每个联合行动上的同等权重。我们通过将权重引入投影来纠正这一点，以便对更好的联合行动更加重视。我们提出了两种加权方案，并证明他们为任何联合行动$ q $值恢复了正确的最大动作，因此也以$ q^*$。根据我们的分析和表格设置的结果，我们引入了两个可扩展版本的算法，中央加权（CW）QMIX和乐观的（OW）QMIX的QMIX，并在Predator-Prey和挑战性多功能方面的Starcraft Starcraft Benchmark基准任务上展示了改进的性能。

QMIX is a popular $Q$-learning algorithm for cooperative MARL in the centralised training and decentralised execution paradigm. In order to enable easy decentralisation, QMIX restricts the joint action $Q$-values it can represent to be a monotonic mixing of each agent's utilities. However, this restriction prevents it from representing value functions in which an agent's ordering over its actions can depend on other agents' actions. To analyse this representational limitation, we first formalise the objective QMIX optimises, which allows us to view QMIX as an operator that first computes the $Q$-learning targets and then projects them into the space representable by QMIX. This projection returns a representable $Q$-value that minimises the unweighted squared error across all joint actions. We show in particular that this projection can fail to recover the optimal policy even with access to $Q^*$, which primarily stems from the equal weighting placed on each joint action. We rectify this by introducing a weighting into the projection, in order to place more importance on the better joint actions. We propose two weighting schemes and prove that they recover the correct maximal action for any joint action $Q$-values, and therefore for $Q^*$ as well. Based on our analysis and results in the tabular setting, we introduce two scalable versions of our algorithm, Centrally-Weighted (CW) QMIX and Optimistically-Weighted (OW) QMIX and demonstrate improved performance on both predator-prey and challenging multi-agent StarCraft benchmark tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题