MOT：Minimax最佳汤普森采样

论文标题

MOT：Minimax最佳汤普森采样

MOTS: Minimax Optimal Thompson Sampling

论文作者

Jin, Tianyuan, Xu, Pan, Shi, Jieming, Xiao, Xiaokui, Gu, Quanquan

论文摘要

汤普森采样是许多在线决策问题最广泛使用的算法之一，因为它在实施方面的简单性和优于其他最先进方法的卓越经验性能。尽管它的知名度和经验成功，但对于$ k $ armed的匪徒的问题，汤普森采样是否可以与Minimax下限$ω（\ sqrt {kt}）$相匹配，这仍然是一个空旷的问题，其中$ t $是总的时间范围。在本文中，我们通过提出一种称为MOT的变体来解决这个长期的开放问题，该样本称为MOT，在每个时间步骤中适应了所选手臂的采样实例。我们证明，汤普森采样的这种简单变体可以实现有限的时间范围$ t $ $ o（\ sqrt {kt}）$的最小值遗憾，并在$ t $接近infinity的情况下，在有限的时间范围内$ t $。据我们所知，MOTS是第一个汤普森采样类型算法，它可以达到多臂匪徒问题的最小值最佳性。

Thompson sampling is one of the most widely used algorithms for many online decision problems, due to its simplicity in implementation and superior empirical performance over other state-of-the-art methods. Despite its popularity and empirical success, it has remained an open problem whether Thompson sampling can match the minimax lower bound $Ω(\sqrt{KT})$ for $K$-armed bandit problems, where $T$ is the total time horizon. In this paper, we solve this long open problem by proposing a variant of Thompson sampling called MOTS that adaptively clips the sampling instance of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(\sqrt{KT})$ for finite time horizon $T$, as well as the asymptotic optimal regret bound for Gaussian rewards when $T$ approaches infinity. To our knowledge, MOTS is the first Thompson sampling type algorithm that achieves the minimax optimality for multi-armed bandit problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题