论文标题

强大MDP的在线政策优化

Online Policy Optimization for Robust MDP

论文作者

Dong, Jing, Li, Jingwei, Wang, Baoxiang, Zhang, Jingzhao

论文摘要

在许多综合设置(例如视频游戏)和GO中,增强学习(RL)超出了人类的绩效。但是,端到端RL模型的现实部署不太常见,因为RL模型对环境的轻微扰动非常敏感。强大的马尔可夫决策过程(MDP)框架(其中的过渡概率属于围绕名义模型设置的不确定性)提供了一种开发健壮模型的方法。虽然先前的分析表明,RL算法假设访问生成模型是有效的,但尚不清楚RL在更现实的在线设置下是否可以有效,这需要在探索和开发之间保持仔细的平衡。在这项工作中,我们通过与未知的名义系统进行互动来考虑在线强大的MDP。我们提出了一种强大的乐观政策优化算法,该算法可有效。为了解决由对抗性环境引起的其他不确定性,我们的模型具有通过Fenchel Conjugates得出的新的乐观更新规则。我们的分析确立了在线强大MDP的第一个遗憾。

Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go. However, real-world deployment of end-to-end RL models is less common, as RL models can be very sensitive to slight perturbation of the environment. The robust Markov decision process (MDP) framework -- in which the transition probabilities belong to an uncertainty set around a nominal model -- provides one way to develop robust models. While previous analysis shows RL algorithms are effective assuming access to a generative model, it remains unclear whether RL can be efficient under a more realistic online setting, which requires a careful balance between exploration and exploitation. In this work, we consider online robust MDP by interacting with an unknown nominal system. We propose a robust optimistic policy optimization algorithm that is provably efficient. To address the additional uncertainty caused by an adversarial environment, our model features a new optimistic update rule derived via Fenchel conjugates. Our analysis establishes the first regret bound for online robust MDPs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源