线性MDP中的最佳政策标识

论文标题

线性MDP中的最佳政策标识

Best Policy Identification in Linear MDPs

论文作者

Taupin, Jerome, Jedra, Yassir, Proutiere, Alexandre

论文摘要

我们研究了在生成模型下，在固定置信度设置中的折扣线性马尔可夫决策过程中调查了最佳政策识别的问题。我们首先在识别$ \ varepsilon $ - 最佳策略所需的预期样本数量的情况下，得出了实例特定的下限，概率为$ 1-δ$。下边界将最佳采样规则表征为复杂的非凸优化程序的解决方案，但可以用作设计简单而近乎最佳的采样规则和算法的起点。我们设计了这样的算法。其中一种展示了$ {\ cal o}（{\ frac {d} {（\ varepsilon+δ）^2}}}}}}}（\ log（\ frac {1}δ）+d）$在其中，$δ$表示最小奖励$ dem $ d is $ d in IS d DIS the IS DIS the IS DIS的奖励，该上限处于中等信心状态（即，对于所有$δ$），并与现有的最小值和间隙相关的下限匹配。我们将算法扩展到情节线性MDP。

We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an $\varepsilon$-optimal policy with probability $1-δ$. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by ${\cal O}({\frac{d}{(\varepsilon+Δ)^2}} (\log(\frac{1}δ)+d))$ where $Δ$ denotes the minimum reward gap of sub-optimal actions and $d$ is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all $δ$), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题