稳定的加固学习，没有边界的状态空间

论文标题

稳定的加固学习，没有边界的状态空间

Stable Reinforcement Learning with Unbounded State Space

论文作者

Shah, Devavrat, Xie, Qiaomin, Xu, Zhi

论文摘要

我们考虑了强化学习（RL）的问题，该问题是由排队网络安排的经典问题所激发的无限状态空间。用于有限，有限或紧凑的状态空间的传统策略以及误差度量，需要无限的样本，以提供无限制状态空间的任何有意义的性能保证（例如$ \ ell_ \ infty $错误）。也就是说，我们需要一个新的性能指标概念。作为这项工作的主要贡献，受排队系统和控制理论的文献的启发，我们提出稳定性作为“善良”的概念：政策下的国家动态应保持较高的可能性。作为概念的证明，我们使用基于稀疏的基于基于蒙特卡洛的蒙特卡洛·奥克莱斯（Monte Carlo Oracle）提出了RL策略，并认为只要最佳策略下的系统动态尊重Lyapunov功能，它就可以满足稳定性。 Lyapunov函数的存在并不是限制的，因为它等于任何马尔可夫链的正复发或稳定性，即，如果有任何策略可以稳定系统，则必须拥有Lyapunov函数。而且，我们的政策不会利用特定Lyapunov功能的知识。为了提高我们的方法样本，我们提供了具有Lipschitz值函数的改进，有效的基于稀疏抽样的蒙特卡洛·奥克斯（Monte Carlo Oracle），这本身可能引起人们的关注。此外，我们根据精心构造的统计测试设计了算法的自适应版本，该测试自动找到正确的调整参数。

We consider the problem of reinforcement learning (RL) with unbounded state space motivated by the classical problem of scheduling in a queueing network. Traditional policies as well as error metric that are designed for finite, bounded or compact state space, require infinite samples for providing any meaningful performance guarantee (e.g. $\ell_\infty$ error) for unbounded state space. That is, we need a new notion of performance metric. As the main contribution of this work, inspired by the literature in queuing systems and control theory, we propose stability as the notion of "goodness": the state dynamics under the policy should remain in a bounded region with high probability. As a proof of concept, we propose an RL policy using Sparse-Sampling-based Monte Carlo Oracle and argue that it satisfies the stability property as long as the system dynamics under the optimal policy respects a Lyapunov function. The assumption of existence of a Lyapunov function is not restrictive as it is equivalent to the positive recurrence or stability property of any Markov chain, i.e., if there is any policy that can stabilize the system then it must possess a Lyapunov function. And, our policy does not utilize the knowledge of the specific Lyapunov function. To make our method sample efficient, we provide an improved, sample efficient Sparse-Sampling-based Monte Carlo Oracle with Lipschitz value function that may be of interest in its own right. Furthermore, we design an adaptive version of the algorithm, based on carefully constructed statistical tests, which finds the correct tuning parameter automatically.

下载PDF全文

下载文献需遵守相关版权规定

论文标题