论文标题
改进了土匪保守探索算法
Improved Algorithms for Conservative Exploration in Bandits
论文作者
论文摘要
在数字营销,医疗保健,金融和机器人技术等许多领域中,通常在生产中运行经过良好测试且可靠的基线政策(例如,推荐系统)。尽管如此,基线政策通常是次优的。在这种情况下,希望在在线学习算法(例如,多臂强盗算法)与系统进行交互,以在学习过程中学习更好/最佳的策略,即在学习过程中,绩效几乎从未比基线本身的表现更糟。在本文中,我们在上下文线性匪徒设置中研究了保守的学习问题,并引入了一种新型算法,即保守的约束linucb(Clucb2)。我们为CLUCB2提供了遗憾的范围,该范围与现有结果匹配,并从经验上表明,在许多合成和现实世界中的问题中,它的表现优于最先进的保守强盗算法。最后,我们考虑一个更现实的约束,其中仅在预定义的检查点(而不是在每个步骤)上验证性能,并显示这种放松的约束如何有利地影响ClucB2的遗憾和经验表现。
In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.