论文标题
具有可共享有限容量的武器的多游戏随机匪徒
Multiple-Play Stochastic Bandits with Shareable Finite-Capacity Arms
论文作者
论文摘要
我们通过可共享的手臂设置概括了多武器的多臂匪徒(MP-MAB)问题,其中几场比赛可以共享同一臂。此外,每个可共享的组都有有限的奖励能力和“每载”奖励分配,这两者都是学习者所不知道的。可共享臂的奖励取决于负载,这是“每载”奖励倍数的“每载”奖励,或者当戏剧数量超过容量限制时,戏剧数量或奖励能力。当“按负载”奖励遵循高斯分布时,我们证明了样本复杂性的下限,这是从负载依赖的奖励中学习能力的下降,也遗憾了这个新的MP-MAB问题的下限。我们设计了一个容量估计器,其样品复杂性上限在奖励手段和能力方面与下限匹配。我们还提出了一种在线学习算法来解决该问题并证明其遗憾的上限。这个遗憾的上限的第一个任期与遗憾的下限相同,其第二和第三个任期显然也对应于下边界。广泛的实验验证了我们算法的性能以及其在5G和4G基站选择中的增长。
We generalize the multiple-play multi-armed bandits (MP-MAB) problem with a shareable arm setting, in which several plays can share the same arm. Furthermore, each shareable arm has a finite reward capacity and a ''per-load'' reward distribution, both of which are unknown to the learner. The reward from a shareable arm is load-dependent, which is the "per-load" reward multiplying either the number of plays pulling the arm, or its reward capacity when the number of plays exceeds the capacity limit. When the "per-load" reward follows a Gaussian distribution, we prove a sample complexity lower bound of learning the capacity from load-dependent rewards and also a regret lower bound of this new MP-MAB problem. We devise a capacity estimator whose sample complexity upper bound matches the lower bound in terms of reward means and capacities. We also propose an online learning algorithm to address the problem and prove its regret upper bound. This regret upper bound's first term is the same as regret lower bound's, and its second and third terms also evidently correspond to lower bound's. Extensive experiments validate our algorithm's performance and also its gain in 5G & 4G base station selection.