完全分散的单时间演员批评的有限时间分析

论文标题

完全分散的单时间演员批评的有限时间分析

Finite-Time Analysis of Fully Decentralized Single-Timescale Actor-Critic

论文作者

Luo, Qijun, Li, Xiao

论文摘要

分散的参与者 - 批评（AC）算法已被广泛用于多机构增强学习（MARL），并取得了巨大的成功。除了其经验成功之外，分散的AC算法的理论收敛性在很大程度上没有探索。现有的大多数有限时间收敛结果都是根据双环更新或两个timesscale步骤规则得出的，即使在单个代理设置下集中式AC算法也是如此。实际上，\ emph {单时间尺度}的更新被广泛使用，其中演员和评论家以交替的方式进行更新，而步进大小的顺序相同。在这项工作中，我们研究了一个分散的\ emph {单时间计} ac algorithm.theoret.therettherthertherthe，使用线性近似进行价值和奖励估计，我们表明该算法具有$ \ tilde {\ Mathcal {\ Mathcal {o}}（O}}（\ varepsilon^prompection）的样本复杂性，该算法与Markpian sampian sampian sampian sampians $双环实现（在此，$ \ tilde {\ Mathcal {o}} $隐藏了对数项）。当我们简化为单个代理设置时，我们的结果将使用单时间尺度更新方案为集中式AC产生新的样本复杂性。建立我们的复杂性结果的核心是\ emph {我们揭示的最佳评论家变量的隐藏平滑度}。我们还提供了算法及其分析的本地动作隐私版本。最后，我们进行实验，以显示算法优于现有的分散AC算法的优越性。

Decentralized Actor-Critic (AC) algorithms have been widely utilized for multi-agent reinforcement learning (MARL) and have achieved remarkable success. Apart from its empirical success, the theoretical convergence property of decentralized AC algorithms is largely unexplored. Most of the existing finite-time convergence results are derived based on either double-loop update or two-timescale step sizes rule, and this is the case even for centralized AC algorithm under a single-agent setting. In practice, the \emph{single-timescale} update is widely utilized, where actor and critic are updated in an alternating manner with step sizes being of the same order. In this work, we study a decentralized \emph{single-timescale} AC algorithm.Theoretically, using linear approximation for value and reward estimation, we show that the algorithm has sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-2})$ under Markovian sampling, which matches the optimal complexity with a double-loop implementation (here, $\tilde{\mathcal{O}}$ hides a logarithmic term). When we reduce to the single-agent setting, our result yields new sample complexity for centralized AC using a single-timescale update scheme. The central to establishing our complexity results is \emph{the hidden smoothness of the optimal critic variable} we revealed. We also provide a local action privacy-preserving version of our algorithm and its analysis. Finally, we conduct experiments to show the superiority of our algorithm over the existing decentralized AC algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题