论文标题
使用最小二乘政策迭代具有可证明的绩效保证的强大加强学习
Robust Reinforcement Learning using Least Squares Policy Iteration with Provable Performance Guarantees
论文作者
论文摘要
本文解决了具有较大状态空间的强大马尔可夫决策过程(RMDP)的无模型增强学习问题。 RMDP框架的目的是找到一项策略,该策略因模拟器模型和现实世界设置之间的不匹配而与参数不确定性相关。我们首先提出了可靠的最小二乘政策评估算法,该算法是一种用于政策评估的多步在线学习算法。我们使用随机近似技术证明了该算法的收敛性。然后,我们提出了可靠的最小二乘策略迭代(RLSPI)算法,用于学习最佳的健壮政策。我们还为最终策略的错误(接近最佳性)绑定了一般加权的欧几里得规范。最后,我们证明了我们的RLSPI算法在某些标准基准问题上的性能。
This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces. The goal of the RMDP framework is to find a policy that is robust against the parameter uncertainties due to the mismatch between the simulator model and real-world settings. We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation. We prove the convergence of this algorithm using stochastic approximation techniques. We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy. We also give a general weighted Euclidean norm bound on the error (closeness to optimality) of the resulting policy. Finally, we demonstrate the performance of our RLSPI algorithm on some standard benchmark problems.