论文标题
同时学习具有已知过渡的随机和对抗性情节MDP
Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition
论文作者
论文摘要
这项工作研究了以已知的过渡和匪徒反馈来学习情节马尔可夫决策过程的问题。我们开发了第一个算法``最好的世界''保证:当损失是随机的损失时,它可以实现$ \ Mathcal {o}(log t)$遗憾,同时享受$ \ tilde {\ nathcal {\ nathcal {o}} $ nserial of $ \ tilde {其中$ t $是情节的数量。更一般而言,它在中间设置中实现了$ \ tilde {\ mathcal {o}}(\ sqrt {c})$遗憾,在中间设置中,损失被损坏的总额为$ c $。我们的算法基于Zimin and Neu(2013)的后续规范领导方法,其新型混合正规剂受Zimmert等人的最新作品的启发。 (2019a,2019b)用于多军匪徒的特殊情况。至关重要的是,我们的常规化合物承认一个非对角线的黑森,具有高度复杂的逆。分析这样的正规化器并得出特定的自我结合的遗憾保证是我们的关键技术贡献,并且可能具有独立的兴趣。
This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a ``best-of-both-worlds'' guarantee: it achieves $\mathcal{O}(log T)$ regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with $\tilde{\mathcal{O}}(\sqrt{T})$ regret even when the losses are adversarial, where $T$ is the number of episodes. More generally, it achieves $\tilde{\mathcal{O}}(\sqrt{C})$ regret in an intermediate setting where the losses are corrupted by a total amount of $C$. Our algorithm is based on the Follow-the-Regularized-Leader method from Zimin and Neu (2013), with a novel hybrid regularizer inspired by recent works of Zimmert et al. (2019a, 2019b) for the special case of multi-armed bandits. Crucially, our regularizer admits a non-diagonal Hessian with a highly complicated inverse. Analyzing such a regularizer and deriving a particular self-bounding regret guarantee is our key technical contribution and might be of independent interest.