块政策镜下降

论文标题

块政策镜下降

Block Policy Mirror Descent

论文作者

Lan, Guanghui, Li, Yan, Zhao, Tuo

论文摘要

在本文中，我们提出了一种新的策略梯度（PG）方法，即用于解决（强烈） - convex正规机构的块策略镜下降（BPMD）方法，用于解决一类正规化加固学习（RL）问题。与带有批处理更新规则的传统PG方法（访问和更新每个状态的策略）相比，BPMD方法通过部分更新规则具有廉价的每卷计算，该规则在采样状态上执行策略更新。尽管问题的性质和部分更新规则具有非概念性质，但我们还是为多种采样方案提供了统一的分析，并表明BPMD可以实现快速的线性收敛到全局最优性。特别是，均匀的采样导致可比的最坏情况总计算复杂性与批处理PG方法。还确定了一种与实盘采样的必要条件。通过混合采样方案，我们进一步表明，BPMD享有潜在的实例依赖性加速度，从而改善了对状态空间的依赖性，因此优于批次PG方法。然后，我们通过利用从样本构建的随机一阶信息扩展到随机设置。使用生成型模型，$ \ tilde {\ Mathcal {o}}（\ left \ lest \ lvert \ natercal {s} \ right \ rvert \ rvert \ rvert \ left \ left \ lewt \ lovt \ mathcal {a} \ right \ right \ rvert \ rvert /ε）$ \ Mathcal {s} \ right \ rvert \ left \ lvert \ mathcal {a} \ right \ rvert /ε^2）$）$）$）样本复杂性是针对强 - convex（reves。nontronglongly-convex）的样本复杂性，其中$ε$表示目标精度。据我们所知，这是第一次开发和分析了块坐标下降方法，以进行强化学习的策略优化，这为解决大规模RL问题提供了新的观点。

In this paper, we present a new policy gradient (PG) methods, namely the block policy mirror descent (BPMD) method for solving a class of regularized reinforcement learning (RL) problems with (strongly)-convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, BPMD method has cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, we provide a unified analysis for several sampling schemes, and show that BPMD achieves fast linear convergence to the global optimality. In particular, uniform sampling leads to comparable worst-case total computational complexity as batch PG methods. A necessary and sufficient condition for convergence with on-policy sampling is also identified. With a hybrid sampling scheme, we further show that BPMD enjoys potential instance-dependent acceleration, leading to improved dependence on the state space and consequently outperforming batch PG methods. We then extend BPMD methods to the stochastic setting, by utilizing stochastic first-order information constructed from samples. With a generative model, $\tilde{\mathcal{O}}(\left\lvert \mathcal{S}\right\rvert \left\lvert \mathcal{A}\right\rvert /ε)$ (resp. $\tilde{\mathcal{O}}(\left\lvert \mathcal{S}\right\rvert \left\lvert \mathcal{A} \right\rvert /ε^2)$) sample complexities are established for the strongly-convex (resp. non-strongly-convex) regularizers, where $ε$ denotes the target accuracy. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning, which provides a new perspective on solving large-scale RL problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题