论文标题

完全分散的单时间演员批评的有限时间分析

Finite-Time Analysis of Fully Decentralized Single-Timescale Actor-Critic

论文作者

Luo, Qijun, Li, Xiao

论文摘要

分散的参与者 - 批评(AC)算法已被广泛用于多机构增强学习(MARL),并取得了巨大的成功。除了其经验成功之外,分散的AC算法的理论收敛性在很大程度上没有探索。现有的大多数有限时间收敛结果都是根据双环更新或两个timesscale步骤规则得出的,即使在单个代理设置下集中式AC算法也是如此。实际上,\ emph {单时间尺度}的更新被广泛使用,其中演员和评论家以交替的方式进行更新,而步进大小的顺序相同。在这项工作中,我们研究了一个分散的\ emph {单时间计} ac algorithm.theoret.therettherthertherthe,使用线性近似进行价值和奖励估计,我们表明该算法具有$ \ tilde {\ Mathcal {\ Mathcal {o}}(O}}(\ varepsilon^prompection)的样本复杂性,该算法与Markpian sampian sampian sampian sampians $双环实现(在此,$ \ tilde {\ Mathcal {o}} $隐藏了对数项)。当我们简化为单个代理设置时,我们的结果将使用单时间尺度更新方案为集中式AC产生新的样本复杂性。建立我们的复杂性结果的核心是\ emph {我们揭示的最佳评论家变量的隐藏平滑度}。我们还提供了算法及其分析的本地动作隐私版本。最后,我们进行实验,以显示算法优于现有的分散AC算法的优越性。

Decentralized Actor-Critic (AC) algorithms have been widely utilized for multi-agent reinforcement learning (MARL) and have achieved remarkable success. Apart from its empirical success, the theoretical convergence property of decentralized AC algorithms is largely unexplored. Most of the existing finite-time convergence results are derived based on either double-loop update or two-timescale step sizes rule, and this is the case even for centralized AC algorithm under a single-agent setting. In practice, the \emph{single-timescale} update is widely utilized, where actor and critic are updated in an alternating manner with step sizes being of the same order. In this work, we study a decentralized \emph{single-timescale} AC algorithm.Theoretically, using linear approximation for value and reward estimation, we show that the algorithm has sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-2})$ under Markovian sampling, which matches the optimal complexity with a double-loop implementation (here, $\tilde{\mathcal{O}}$ hides a logarithmic term). When we reduce to the single-agent setting, our result yields new sample complexity for centralized AC using a single-timescale update scheme. The central to establishing our complexity results is \emph{the hidden smoothness of the optimal critic variable} we revealed. We also provide a local action privacy-preserving version of our algorithm and its analysis. Finally, we conduct experiments to show the superiority of our algorithm over the existing decentralized AC algorithms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源