上下文阻塞匪徒

论文标题

上下文阻塞匪徒

Contextual Blocking Bandits

论文作者

Basu, Soumya, Papadigenopoulos, Orestis, Caramanis, Constantine, Shakkottai, Sanjay

论文摘要

我们研究了多臂匪徒问题的新型变体，在每个时间步骤中，玩家都会观察到一个独立采样的上下文，以决定武器的平均奖励。但是，弹奏手臂（在所有上下文中）以固定数量和已知的未来时间步骤阻止它。上面的上下文设置捕获了重要方案，例如推荐系统或与不同用户的广告放置，使贪婪的解决方案技术无效，这些技术对其非上下文对应物有效（Basu等，Neurips19）。假设了解上下文分布和每个Arm-Context对的平均奖励，我们将该问题作为在线二手匹配问题，在这种问题上，右垂术（上下文）随机到达，每次匹配它们时，右转（武器）被阻止了有限的回合。最近已经在竞争比率得出的全信息情况下研究了这个问题。我们专注于大匪徒，最初奖励分布是未知的。我们提出了一个基于UCB的全信息算法的变体，该变体保证了$ \ Mathcal {o}（\ log t）$ - 遗憾的是W.R.T. $α$ - 最佳的策略在$ t $时间步长中，与$ω（\ log（t））$遗憾的下限匹配。由于阻塞引起的时间相关，现有的上限遗憾技术失败了。为了证明我们的遗憾界限，我们介绍了延迟剥削和机会主义子采样的新颖概念，并将它们与组合匪徒和非本地马尔可夫连锁链结合起来。

We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms' mean rewards. However, playing an arm blocks it (across all contexts) for a fixed and known number of future time steps. The above contextual setting, which captures important scenarios such as recommendation systems or ad placement with diverse users, invalidates greedy solution techniques that are effective for its non-contextual counterpart (Basu et al., NeurIPS19). Assuming knowledge of the context distribution and the mean reward of each arm-context pair, we cast the problem as an online bipartite matching problem, where the right-vertices (contexts) arrive stochastically and the left-vertices (arms) are blocked for a finite number of rounds each time they are matched. This problem has been recently studied in the full-information case, where competitive ratio bounds have been derived. We focus on the bandit setting, where the reward distributions are initially unknown; we propose a UCB-based variant of the full-information algorithm that guarantees a $\mathcal{O}(\log T)$-regret w.r.t. an $α$-optimal strategy in $T$ time steps, matching the $Ω(\log(T))$ regret lower bound in this setting. Due to the time correlations caused by blocking, existing techniques for upper bounding regret fail. For proving our regret bounds, we introduce the novel concepts of delayed exploitation and opportunistic subsampling and combine them with ideas from combinatorial bandits and non-stationary Markov chains coupling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题