论文标题
改善最小二乘价值迭代的最坏遗憾范围
Improved Worst-Case Regret Bounds for Randomized Least-Squares Value Iteration
论文作者
论文摘要
本文研究在增强学习中使用随机价值函数遗憾的是最小化。在表格有限的马尔可夫决策过程中,我们引入了一种经典汤普森采样(TS)类似算法的剪辑变体,随机最小二乘值迭代(RLSVI)。我们的$ \ tilde {\ mathrm {o}}(h^2s \ sqrt {at})$高概率的糟糕态度后悔界限改善了先前最尖锐的最尖锐的遗憾范围,并匹配现有的现有的先进的基于最差的基于最差的基于最差的基于最差的遗憾。
This paper studies regret minimization with randomized value functions in reinforcement learning. In tabular finite-horizon Markov Decision Processes, we introduce a clipping variant of one classical Thompson Sampling (TS)-like algorithm, randomized least-squares value iteration (RLSVI). Our $\tilde{\mathrm{O}}(H^2S\sqrt{AT})$ high-probability worst-case regret bound improves the previous sharpest worst-case regret bounds for RLSVI and matches the existing state-of-the-art worst-case TS-based regret bounds.