土匪算法的残留引导探索

论文标题

土匪算法的残留引导探索

Residual Bootstrap Exploration for Bandit Algorithms

论文作者

Wang, Chi-Hua, Yu, Yang, Hao, Botao, Cheng, Guang

论文摘要

在本文中，我们提出了一种基于扰动的新型探索方法，该方法具有有界或无限的奖励，称为“残留自举”探索（\ texttt {reboot}）。 \ texttt {重新启动}通过通过基于残余的扰动机制注入数据驱动的随机性来强制探索。这种新颖的机制捕获了拟合错误的潜在分布特性，更重要的是，通过以\ textit {非常规}的方式膨胀方差水平，可以促进探索以摆脱次优溶液（对于小样本量）。从理论上讲，\ texttt {重新启动}可证明具有适当的差异通胀水平，可证明在高斯多臂匪徒中依赖实例依赖的对数遗憾。我们在不同的合成多武器匪徒问题中评估\ texttt {重新启动}问题，并观察到\ texttt {reboot}的性能比无限的奖励性能更好，并且比\ texttt {giro} \ cite {kveton2018garbage} and \ texttttt {giro} \ cite {kveton2019perterded}，具有与汤普森采样方法相当的计算效率。

In this paper, we propose a novel perturbation-based exploration method in bandit algorithms with bounded or unbounded rewards, called residual bootstrap exploration (\texttt{ReBoot}). The \texttt{ReBoot} enforces exploration by injecting data-driven randomness through a residual-based perturbation mechanism. This novel mechanism captures the underlying distributional properties of fitting errors, and more importantly boosts exploration to escape from suboptimal solutions (for small sample sizes) by inflating variance level in an \textit{unconventional} way. In theory, with appropriate variance inflation level, \texttt{ReBoot} provably secures instance-dependent logarithmic regret in Gaussian multi-armed bandits. We evaluate the \texttt{ReBoot} in different synthetic multi-armed bandits problems and observe that the \texttt{ReBoot} performs better for unbounded rewards and more robustly than \texttt{Giro} \cite{kveton2018garbage} and \texttt{PHE} \cite{kveton2019perturbed}, with comparable computational efficiency to the Thompson sampling method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题