在部分可观察的动力学系统中，可证明有效的增强学习

论文标题

在部分可观察的动力学系统中，可证明有效的增强学习

Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems

论文作者

Uehara, Masatoshi, Sekhari, Ayush, Lee, Jason D., Kallus, Nathan, Sun, Wen

论文摘要

我们研究使用功能近似的部分可观察到的动力学系统的增强学习。 We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDP具有潜在低级过渡。在此框架下，我们提出了一种能够执行不可知论政策学习的参与者批评算法。给定一个由基于内存的策略组成的策略类别（查看最近观察结果的固定长度窗口），以及一个值函数类别，该类别由以内存和未来观察为输入为功能组成，我们的算法学会了与给定策略类别中的最佳内存策略竞争。对于某些示例，例如可观察到的表格POMDP，可观察到的LQG和可观察到的具有潜在低位过渡的POMDP，通过隐式利用其特殊特性，我们的算法甚至能够与全球最佳策略竞争，而无需支付其样品复合度的高度依赖性。

We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition. Under this framework, we propose an actor-critic style algorithm that is capable of performing agnostic policy learning. Given a policy class that consists of memory based policies (that look at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy in the given policy class. For certain examples such as undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs with latent low-rank transition, by implicitly leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon in its sample complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题