在马尔可夫决策过程中以最佳政策识别的自适应抽样

论文标题

在马尔可夫决策过程中以最佳政策识别的自适应抽样

Adaptive Sampling for Best Policy Identification in Markov Decision Processes

论文作者

Marjani, Aymen Al, Proutiere, Alexandre

论文摘要

当学习者可以使用生成模型时，我们研究了折扣马尔可夫决策过程（MDP）中最佳政策识别的问题。目的是设计一种学习算法，尽早返回最佳政策。我们首先得出任何学习算法满足样本复杂性的特定于问题的下限。该下限对应于解决非Convex程序的最佳样本分配，因此很难在高效算法的设计中利用。然后，我们提供了样品复杂性下限的简单紧密的上限，其相应的几乎最佳样品分配变得明确。上限取决于MDP的特定功能，例如亚典型性差距和下一个状态值函数的方差，因此确实捕获了MDP的硬度。最后，我们设计了KLB-TS（KL Ball Track and-Stop），这是一种跟踪此几乎最佳分配的算法，并为其样品复杂性提供了渐近保证（几乎肯定和预期）。讨论并以数值为例，讨论并说明了针对最新算法的KLB-TS的优势。

We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) when the learner has access to a generative model. The objective is to devise a learning algorithm returning the best policy as early as possible. We first derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We then provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimality gaps and the variance of the next-state value function, and thus really captures the hardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are discussed and illustrated numerically.

下载PDF全文

下载文献需遵守相关版权规定

论文标题