积极强化学习中的近乎最佳政策识别

论文标题

积极强化学习中的近乎最佳政策识别

Near-optimal Policy Identification in Active Reinforcement Learning

论文作者

Li, Xiang, Mehta, Viraj, Kirschner, Johannes, Char, Ian, Neiswanger, Willie, Schneider, Jeff, Krause, Andreas, Bogunovic, Ilija

论文摘要

许多现实世界的强化学习任务需要控制复杂的动态系统，既涉及昂贵的数据采集过程和大型状态空间。如果可以在指定状态（例如，通过模拟器）轻松评估过渡动力学，则代理可以用\ emph {generative Model}进行计划。我们提出了用于最佳政策识别的AE-LSVI算法，这是内核最小二乘价值迭代（LSVI）算法的新型变体，将乐观与对主动探索（AE）的悲观主义相结合（AE）。 AE-LSVI可证明在整个状态空间上确定了近乎最佳的策略\ emph {统一}，并实现多项式样本复杂性可确保独立于状态的数量。当专门针对最近引入的离线上下文贝叶斯优化设置时，我们的算法就可以提高样品复杂性界限。在实验上，我们证明，当需要对初始状态的鲁棒性时，在各种环境中，AE-LSVI在各种环境中的表现都优于其他RL算法。

Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a \emph{generative model}. We propose the AE-LSVI algorithm for best-policy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy \emph{uniformly} over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.

下载PDF全文

下载文献需遵守相关版权规定

论文标题