混合加固学习与加性随机不确定性

论文标题

混合加固学习与加性随机不确定性

Mixed Reinforcement Learning with Additive Stochastic Uncertainty

论文作者

Mu, Yao, Li, Shengbo Eben, Liu, Chang, Sun, Qi, Nie, Bingbing, Cheng, Bo, Peng, Baiyu

论文摘要

加强学习（RL）方法通常依靠大量的勘探数据来搜索最佳政策，并遭受采样效率差的障碍。本文通过使用环境动态的双重表示来搜索最佳策略，以提高学习准确性和训练速度，从而提出了混合加固学习（混合RL）算法。双重表示表示环境模型和国家行动数据：前者可以加速RL的学习过程，而其固有的模型不确定性通常会导致政策准确性比后者更差，后者来自对国家和行动的直接测量。在混合RL的框架设计中，通过使用迭代贝叶斯估算器（IBE）使用探索的状态行动数据，将加性随机模型不确定性的补偿嵌入了策略迭代RL框架中。然后，通过在策略评估（PEV）和策略改进（PIM）之间交替以迭代方式计算最佳策略。混合RL的融合使用Bellman的最佳原则证明了这一策略的递归稳定性，可以通过Lyapunov的直接方法证明。混合RL的有效性通过随机非携带非线性系统的典型最佳控制问题（即使用自动化车辆的双车道更改任务）证明。

Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency. This paper presents a mixed reinforcement learning (mixed RL) algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy with the purpose of improving both learning accuracy and training speed. The dual representations indicate the environmental model and the state-action data: the former can accelerate the learning process of RL, while its inherent model uncertainty generally leads to worse policy accuracy than the latter, which comes from direct measurements of states and actions. In the framework design of the mixed RL, the compensation of the additive stochastic model uncertainty is embedded inside the policy iteration RL framework by using explored state-action data via iterative Bayesian estimator (IBE). The optimal policy is then computed in an iterative way by alternating between policy evaluation (PEV) and policy improvement (PIM). The convergence of the mixed RL is proved using the Bellman's principle of optimality, and the recursive stability of the generated policy is proved via the Lyapunov's direct method. The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of stochastic non-affine nonlinear systems (i.e., double lane change task with an automated vehicle).

下载PDF全文

下载文献需遵守相关版权规定

论文标题