奖励延迟的多代理增强学习

论文标题

奖励延迟的多代理增强学习

Multi-Agent Reinforcement Learning with Reward Delays

论文作者

Zhang, Yuyang, Zhang, Runyu, Gu, Yuantao, Li, Na

论文摘要

本文考虑了多代理增强学习（MARL），其中延迟后收到奖励，并且延迟时间在代理商和时间步骤之间都会有所不同。基于V-Learning框架，本文提出了有效处理奖励延迟的MARL算法。当延迟是有限的时，我们的算法达到了速率的粗相关平衡（CCE） $ \ tilde {\ Mathcal {o}}（\ frac {h^3 \ sqrt {s \ Mathcal {t} _k}}} {k}+\ frac {h^3 \ sqrt {sqrt {sa}}}}规划范围，$ s $是状态空间的大小，$ a $是最大的动作空间的大小，而$ \ Mathcal {t} _K $是本文正式定义的总延迟的度量。此外，通过奖励跳过计划，我们的算法扩展到了无限延迟的案例。它实现了类似于有限延迟情况的收敛速率。

This paper considers multi-agent reinforcement learning (MARL) where the rewards are received after delays and the delay time varies across agents and across time steps. Based on the V-learning framework, this paper proposes MARL algorithms that efficiently deal with reward delays. When the delays are finite, our algorithm reaches a coarse correlated equilibrium (CCE) with rate $\tilde{\mathcal{O}}(\frac{H^3\sqrt{S\mathcal{T}_K}}{K}+\frac{H^3\sqrt{SA}}{\sqrt{K}})$ where $K$ is the number of episodes, $H$ is the planning horizon, $S$ is the size of the state space, $A$ is the size of the largest action space, and $\mathcal{T}_K$ is the measure of total delay formally defined in the paper. Moreover, our algorithm is extended to cases with infinite delays through a reward skipping scheme. It achieves convergence rate similar to the finite delay case.

下载PDF全文

下载文献需遵守相关版权规定

论文标题