适应对抗性多臂匪徒中的延误和数据

论文标题

适应对抗性多臂匪徒中的延误和数据

Adapting to Delays and Data in Adversarial Multi-Armed Bandits

论文作者

György, András, Joulani, Pooria

论文摘要

我们考虑在延迟反馈下的对抗性多臂强盗问题。我们分析了仅使用信息（关于损失和延迟）在决策时可用的信息（关于损失和延迟）调整其步骤的EXP3算法的变体，并获得遗憾保证可以适应延迟和/或损失的延迟和/或损失的观察到的（而不是最差的）序列。首先，通过一种非常简单的证明技术，我们表明，通过对步长进行正确调整，该算法实现了最佳（对数因素）的遗憾$ \ sqrt {\ log（k）（k）（tk + d）} $在预期和可能性高的可能性上，并且在很高的可能性上，$ k $ as $ k $ as $ k $是ump yis $ d $ d $。界限的高概率版本，这是文献中第一个高概率的延迟自适应结合，至关重要地取决于在估计损失时使用隐式探索。然后，遵循Zimmert和Seldin [2019]，我们扩展了这些结果，以便算法可以“跳过”大延迟的回合，从而导致遗憾的范围$ \ sqrt {tk \ log（k）} + | r |。 + \ sqrt {d _ {\ bar {r}} \ log（k）} $，其中$ r $是一组任意的回合（跳过）和$ d _ {\ bar {r}} $是其他回合的反馈的累积延迟。最后，我们提出了该算法的另一个数据自适应（Adagrad风格）版本的版本，遗憾适应观察到的（延迟）损失，而不仅仅是适应累积延迟（该算法需要A a a a a a a a a a在做出每个决定时的最大延迟知识或做出的延迟的延迟知识）。最终的结合可能是在良性问题上较小的数量级，并且可以证明延迟仅通过失去最佳臂而影响遗憾。

We consider the adversarial multi-armed bandit problem under delayed feedback. We analyze variants of the Exp3 algorithm that tune their step-size using only information (about the losses and delays) available at the time of the decisions, and obtain regret guarantees that adapt to the observed (rather than the worst-case) sequences of delays and/or losses. First, through a remarkably simple proof technique, we show that with proper tuning of the step size, the algorithm achieves an optimal (up to logarithmic factors) regret of order $\sqrt{\log(K)(TK + D)}$ both in expectation and in high probability, where $K$ is the number of arms, $T$ is the time horizon, and $D$ is the cumulative delay. The high-probability version of the bound, which is the first high-probability delay-adaptive bound in the literature, crucially depends on the use of implicit exploration in estimating the losses. Then, following Zimmert and Seldin [2019], we extend these results so that the algorithm can "skip" rounds with large delays, resulting in regret bounds of order $\sqrt{TK\log(K)} + |R| + \sqrt{D_{\bar{R}}\log(K)}$, where $R$ is an arbitrary set of rounds (which are skipped) and $D_{\bar{R}}$ is the cumulative delay of the feedback for other rounds. Finally, we present another, data-adaptive (AdaGrad-style) version of the algorithm for which the regret adapts to the observed (delayed) losses instead of only adapting to the cumulative delay (this algorithm requires an a priori upper bound on the maximum delay, or the advance knowledge of the delay for each decision when it is made). The resulting bound can be orders of magnitude smaller on benign problems, and it can be shown that the delay only affects the regret through the loss of the best arm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题