通过重复学习：启动效果下的随机多臂匪徒

论文标题

通过重复学习：启动效果下的随机多臂匪徒

Learning by Repetition: Stochastic Multi-armed Bandits under Priming Effect

论文作者

Agrawal, Priyank, Tulabandhula, Theja

论文摘要

我们研究参与物在随机多臂匪徒环境中学习对学习的效果。在广告和推荐系统中，重复效果包括一个磨损期，用户通过点击或购买奖励平台的倾向取决于他们最近在过去看到建议的频率。它还包括一个反破坏磨损期，如果最近显示了太多次建议，用户对响应的倾向会受到抑制。启动效应自然可以作为对战略空间的时间限制，因为当前动作的奖励取决于平台采取的历史行动。我们提供新颖的算法，这些算法在时间上获得了透明的后悔和相关的磨损/磨损参数。在没有启动效果时，启动对遗憾上限的效果也是添加剂，我们恢复了与流行算法（例如UCB1和Thompson采样）相匹配的保证。我们的工作补充了有关建模时间的最新工作，以改变土匪的奖励，延误和腐败，并在顺序决策设置中扩展了丰富的行为模型的使用。

We study the effect of persistence of engagement on learning in a stochastic multi-armed bandit setting. In advertising and recommendation systems, repetition effect includes a wear-in period, where the user's propensity to reward the platform via a click or purchase depends on how frequently they see the recommendation in the recent past. It also includes a counteracting wear-out period, where the user's propensity to respond positively is dampened if the recommendation was shown too many times recently. Priming effect can be naturally modelled as a temporal constraint on the strategy space, since the reward for the current action depends on historical actions taken by the platform. We provide novel algorithms that achieves sublinear regret in time and the relevant wear-in/wear-out parameters. The effect of priming on the regret upper bound is also additive, and we get back a guarantee that matches popular algorithms such as the UCB1 and Thompson sampling when there is no priming effect. Our work complements recent work on modeling time varying rewards, delays and corruptions in bandits, and extends the usage of rich behavior models in sequential decision making settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题