强大MDP的在线政策优化

论文标题

强大MDP的在线政策优化

Online Policy Optimization for Robust MDP

论文作者

Dong, Jing, Li, Jingwei, Wang, Baoxiang, Zhang, Jingzhao

论文摘要

在许多综合设置（例如视频游戏）和GO中，增强学习（RL）超出了人类的绩效。但是，端到端RL模型的现实部署不太常见，因为RL模型对环境的轻微扰动非常敏感。强大的马尔可夫决策过程（MDP）框架（其中的过渡概率属于围绕名义模型设置的不确定性）提供了一种开发健壮模型的方法。虽然先前的分析表明，RL算法假设访问生成模型是有效的，但尚不清楚RL在更现实的在线设置下是否可以有效，这需要在探索和开发之间保持仔细的平衡。在这项工作中，我们通过与未知的名义系统进行互动来考虑在线强大的MDP。我们提出了一种强大的乐观政策优化算法，该算法可有效。为了解决由对抗性环境引起的其他不确定性，我们的模型具有通过Fenchel Conjugates得出的新的乐观更新规则。我们的分析确立了在线强大MDP的第一个遗憾。

Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go. However, real-world deployment of end-to-end RL models is less common, as RL models can be very sensitive to slight perturbation of the environment. The robust Markov decision process (MDP) framework -- in which the transition probabilities belong to an uncertainty set around a nominal model -- provides one way to develop robust models. While previous analysis shows RL algorithms are effective assuming access to a generative model, it remains unclear whether RL can be efficient under a more realistic online setting, which requires a careful balance between exploration and exploitation. In this work, we consider online robust MDP by interacting with an unknown nominal system. We propose a robust optimistic policy optimization algorithm that is provably efficient. To address the additional uncertainty caused by an adversarial environment, our model features a new optimistic update rule derived via Fenchel conjugates. Our analysis establishes the first regret bound for online robust MDPs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题