多才多艺的决斗强盗：从偏好中进行在线学习的最佳世界分析

论文标题

多才多艺的决斗强盗：从偏好中进行在线学习的最佳世界分析

Versatile Dueling Bandits: Best-of-both-World Analyses for Online Learning from Preferences

论文作者

Saha, Aadirupa, Gaillard, Pierre

论文摘要

我们研究了随机环境和对抗性环境的$ k $武装决斗匪徒的问题，学习者的目标是通过以在线顺序方式查询的一对决策点的相对偏好来汇总信息。我们首先提出了从任何（一般）决斗的土匪到多臂匪徒的新颖减少，尽管很简单，但它使我们能够改善决斗匪徒的许多现有结果。特别是，\ emph {我们为决斗匪徒遗憾的最小化问题提供了第一个最佳世界结果} - 一个统一的框架，可以同时为随机和对抗性偏好表现出最佳性能。此外，我们的算法也是第一个实现最佳$ o（\ sum_ {i = 1}^k \ frac {\ log t} {Δ_i}）$遗憾的遗憾，以对condorcet-winner Benchmark的限制，以最佳的范围划分，该范围均以实例$ k $和实例$ k $ k $ k $和实例的$ kaps $ i; = 1}^k $。这解决了设计实例依赖差距的最佳遗憾算法的长期存在的问题，用于决斗（具有匹配的下限至较小的恒定因素）。我们通过证明其在对抗性损坏的偏好下证明其最佳的遗憾率来进一步证明我们提出的算法的鲁棒性 - 这表现优于现有的最新腐败的决斗结果。总而言之，我们认为，我们的简化想法将在解决各种各样的决斗匪徒设置方面找到更广泛的范围，否则这些设置与多臂匪徒分开研究，这些匪徒通常具有更复杂的解决方案和更差的保证。我们提出的算法的功效在经验上证实了现有的决斗匪徒方法。

We study the problem of $K$-armed dueling bandit for both stochastic and adversarial environments, where the goal of the learner is to aggregate information through relative preferences of pair of decisions points queried in an online sequential manner. We first propose a novel reduction from any (general) dueling bandits to multi-armed bandits and despite the simplicity, it allows us to improve many existing results in dueling bandits. In particular, \emph{we give the first best-of-both world result for the dueling bandits regret minimization problem} -- a unified framework that is guaranteed to perform optimally for both stochastic and adversarial preferences simultaneously. Moreover, our algorithm is also the first to achieve an optimal $O(\sum_{i = 1}^K \frac{\log T}{Δ_i})$ regret bound against the Condorcet-winner benchmark, which scales optimally both in terms of the arm-size $K$ and the instance-specific suboptimality gaps $\{Δ_i\}_{i = 1}^K$. This resolves the long-standing problem of designing an instancewise gap-dependent order optimal regret algorithm for dueling bandits (with matching lower bounds up to small constant factors). We further justify the robustness of our proposed algorithm by proving its optimal regret rate under adversarially corrupted preferences -- this outperforms the existing state-of-the-art corrupted dueling results by a large margin. In summary, we believe our reduction idea will find a broader scope in solving a diverse class of dueling bandits setting, which are otherwise studied separately from multi-armed bandits with often more complex solutions and worse guarantees. The efficacy of our proposed algorithms is empirically corroborated against the existing dueling bandit methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题