论文标题
在马尔可夫决策过程中以最佳政策识别的自适应抽样
Adaptive Sampling for Best Policy Identification in Markov Decision Processes
论文作者
论文摘要
当学习者可以使用生成模型时,我们研究了折扣马尔可夫决策过程(MDP)中最佳政策识别的问题。目的是设计一种学习算法,尽早返回最佳政策。我们首先得出任何学习算法满足样本复杂性的特定于问题的下限。该下限对应于解决非Convex程序的最佳样本分配,因此很难在高效算法的设计中利用。然后,我们提供了样品复杂性下限的简单紧密的上限,其相应的几乎最佳样品分配变得明确。上限取决于MDP的特定功能,例如亚典型性差距和下一个状态值函数的方差,因此确实捕获了MDP的硬度。最后,我们设计了KLB-TS(KL Ball Track and-Stop),这是一种跟踪此几乎最佳分配的算法,并为其样品复杂性提供了渐近保证(几乎肯定和预期)。讨论并以数值为例,讨论并说明了针对最新算法的KLB-TS的优势。
We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) when the learner has access to a generative model. The objective is to devise a learning algorithm returning the best policy as early as possible. We first derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We then provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimality gaps and the variance of the next-state value function, and thus really captures the hardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are discussed and illustrated numerically.