具有复合和匿名反馈的多臂匪徒的自适应算法

论文标题

具有复合和匿名反馈的多臂匪徒的自适应算法

Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

论文作者

Wang, Siwei, Wang, Haoyun, Huang, Longbo

论文摘要

我们研究了复合和匿名反馈的多臂强盗（MAB）问题。在此模型中，拉动手臂的奖励在一段时间内（我们称这段时间为奖励间隔），并且玩家获得了动作的部分奖励，并依次从其他手臂拉动了其他手臂的奖励。该模型的现有结果需要有关奖励间隔大小的事先了解，以作为其算法的输入。在本文中，我们建议对随机案例和对抗性案例提出自适应算法，而无需任何有关奖励间隔的事先信息。对于随机情况，我们证明我们的算法保证了与下限（按顺序）相匹配的遗憾。对于对抗情况，我们提出了第一种算法，以共同处理非合理的对手和未知的奖励间隔大小。我们还基于现实世界数据集进行了模拟。结果表明，我们的算法优于现有基准。

We study the multi-armed bandit (MAB) problem with composite and anonymous feedback. In this model, the reward of pulling an arm spreads over a period of time (we call this period as reward interval) and the player receives partial rewards of the action, convoluted with rewards from pulling other arms, successively. Existing results on this model require prior knowledge about the reward interval size as an input to their algorithms. In this paper, we propose adaptive algorithms for both the stochastic and the adversarial cases, without requiring any prior information about the reward interval. For the stochastic case, we prove that our algorithm guarantees a regret that matches the lower bounds (in order). For the adversarial case, we propose the first algorithm to jointly handle non-oblivious adversary and unknown reward interval size. We also conduct simulations based on real-world dataset. The results show that our algorithms outperform existing benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题