论文标题
连续时间多军匪徒,带有受控重新启动
Continuous-Time Multi-Armed Bandits with Controlled Restarts
论文作者
论文摘要
在物理,生物学和计算机科学方面的许多基本应用中,受时限的决策过程无处不在。最近,重新启动策略通过加快完成时间来提高时间约束过程的效率而引起了重大关注。在这项工作中,我们通过受控重新启动的重新启动来调查匪徒问题,以解决时间约束的决策过程,并开发出良好的学习算法。特别是,我们考虑一个强盗设置,每个决策都需要随机完成时间,并在决策时产生随机且相关的奖励,并在决策时具有未知的值。决策者的目标是最大化预期的总奖励,但要受到时间约束$τ$。作为额外的控制,我们允许决策者中断一项持续的任务,并放弃其对潜在有意义的替代方案的奖励。对于此问题,我们使用$ O(\ log(τ))$和$ o(\ sqrt {τ\ log(τ)})$遗憾的是有限的,分别在重新启动策略的有限动作空间中遗憾的有效在线学习算法。我们通过使用它来提高SAT求解器的性能来证明我们的算法的适用性。
Time-constrained decision processes have been ubiquitous in many fundamental applications in physics, biology and computer science. Recently, restart strategies have gained significant attention for boosting the efficiency of time-constrained processes by expediting the completion times. In this work, we investigate the bandit problem with controlled restarts for time-constrained decision processes, and develop provably good learning algorithms. In particular, we consider a bandit setting where each decision takes a random completion time, and yields a random and correlated reward at the end, with unknown values at the time of decision. The goal of the decision-maker is to maximize the expected total reward subject to a time constraint $τ$. As an additional control, we allow the decision-maker to interrupt an ongoing task and forgo its reward for a potentially more rewarding alternative. For this problem, we develop efficient online learning algorithms with $O(\log(τ))$ and $O(\sqrt{τ\log(τ)})$ regret in a finite and continuous action space of restart strategies, respectively. We demonstrate an applicability of our algorithm by using it to boost the performance of SAT solvers.