论文标题
遗憾的是安全高斯工艺强盗优化的界限
Regret Bounds for Safe Gaussian Process Bandit Optimization
论文作者
论文摘要
许多应用程序要求学习者对系统的回报功能和安全限制做出连续决策。在安全 - 关键系统中,最重要的是,学习者的行动不会在学习过程的任何阶段违反安全限制。在本文中,我们研究了一个随机的匪徒优化问题,其中未知的收益和约束函数是从高斯工艺(GPS)中首次提出的[Srinivas等,2010]中首次考虑的。我们开发了一个名为SGP-UCB的GP-UCB的安全变体,并进行了必要的修改,以尊重每一轮的安全限制。该算法具有两个不同的阶段。第一阶段旨在估算决策集中的安全行动集,而第二阶段则遵循GP-UCB决策规则。我们的主要贡献是为此问题得出第一个子线性遗憾界限。我们将SGP-UCB与现有的安全贝叶斯GP优化算法进行比较。
Many applications require a learner to make sequential decisions given uncertainty regarding both the system's payoff function and safety constraints. In safety-critical systems, it is paramount that the learner's actions do not violate the safety constraints at any stage of the learning process. In this paper, we study a stochastic bandit optimization problem where the unknown payoff and constraint functions are sampled from Gaussian Processes (GPs) first considered in [Srinivas et al., 2010]. We develop a safe variant of GP-UCB called SGP-UCB, with necessary modifications to respect safety constraints at every round. The algorithm has two distinct phases. The first phase seeks to estimate the set of safe actions in the decision set, while the second phase follows the GP-UCB decision rule. Our main contribution is to derive the first sub-linear regret bounds for this problem. We numerically compare SGP-UCB against existing safe Bayesian GP optimization algorithms.