少量样品和紧张的保证

论文标题

少量样品和紧张的保证

Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees

论文作者

Tiapkin, Daniil, Belomestny, Denis, Calandriello, Daniele, Moulines, Eric, Munos, Remi, Naumov, Alexey, Rowland, Mark, Valko, Michal, Menard, Pierre

论文摘要

我们考虑在以$ s $州的地平线$ h $和$ a $ ACTIVE的偶发性，有限，依赖阶段的马尔可夫决策过程为模型的环境中进行强化学习。在与环境互动以$ t $情节互动后，代理商的表现是通过遗憾来衡量的。我们提出了一种乐观的后验采样算法（OPSRL），这是一种简单的后验抽样变体，仅需要$ H $，$ S $，$ a $和$ t $的每个后验样品对数。对于OPSRL，我们保证最多可容纳订单的高概率遗憾，$ \ wideTilde {\ Mathcal {o}}（\ sqrt {h^3sat}）$忽略$ \ text {poly} \ log log（hsat）$项。新型新型技术成分是可能具有独立关注的线性形式的新型抗浓缩不等式。具体而言，我们将Alfers and Dinges [1984]的Beta分布的基于正常近似的下限扩展到Dirichlet分布。我们的界限与顺序$ω（\ sqrt {h^3sat}）$的下限匹配，从而回答了Agrawal和Jia [2017b]在情节设置中提出的开放问题。

We consider reinforcement learning in an environment modeled by an episodic, finite, stage-dependent Markov decision process of horizon $H$ with $S$ states, and $A$ actions. The performance of an agent is measured by the regret after interacting with the environment for $T$ episodes. We propose an optimistic posterior sampling algorithm for reinforcement learning (OPSRL), a simple variant of posterior sampling that only needs a number of posterior samples logarithmic in $H$, $S$, $A$, and $T$ per state-action pair. For OPSRL we guarantee a high-probability regret bound of order at most $\widetilde{\mathcal{O}}(\sqrt{H^3SAT})$ ignoring $\text{poly}\log(HSAT)$ terms. The key novel technical ingredient is a new sharp anti-concentration inequality for linear forms which may be of independent interest. Specifically, we extend the normal approximation-based lower bound for Beta distributions by Alfers and Dinges [1984] to Dirichlet distributions. Our bound matches the lower bound of order $Ω(\sqrt{H^3SAT})$, thereby answering the open problems raised by Agrawal and Jia [2017b] for the episodic setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题