舞台上的保守线性匪徒

论文标题

舞台上的保守线性匪徒

Stage-wise Conservative Linear Bandits

论文作者

Moradipari, Ahmadreza, Thrampoulidis, Christos, Alizadeh, Mahnoosh

论文摘要

我们研究阶段的保守线性随机匪徒：土匪优化的实例，该实例是（未知）安全限制，这些安全限制出现在在线广告和医学试验等应用中。在每个阶段，学习者都必须选择不仅在整个时间范围内最大化累积奖励的动作，而且还可以进一步满足线性基线约束，从而在瞬时奖励上采用下限的形式。对于这个问题，我们提出了两种新颖的算法，即阶段的保守线性汤普森采样（SCLT）和阶段的保守线性UCB（SCLUCB），它们尊重基线约束并享受O（\ sqrt {t} {t} {t} \ log^{3/2} t）和O（分别。值得注意的是，所提出的算法只能通过较小的修改来调整，以解决不同的问题变化，例如用匪徒反馈的约束或基线动作的未知顺序。我们讨论了这些和其他对最新的改进。例如，与现有解决方案相比，我们表明SCLT在O（\ log {t}）时间最多扮演（非最佳）基线动作（与O（\ sqrt {t}相比））。最后，我们与另一种研究形式的安全限制形式建立联系，该形式采用了瞬时奖励的上限形式。尽管不能保证在每轮比赛中属于安全设置，但这对学习过程产生了额外的复杂性，但我们表明SCLUCB可以通过简单的修改在此设置中正确调整。

We study stage-wise conservative linear stochastic bandits: an instance of bandit optimization, which accounts for (unknown) safety constraints that appear in applications such as online advertising and medical trials. At each stage, the learner must choose actions that not only maximize cumulative reward across the entire time horizon but further satisfy a linear baseline constraint that takes the form of a lower bound on the instantaneous reward. For this problem, we present two novel algorithms, stage-wise conservative linear Thompson Sampling (SCLTS) and stage-wise conservative linear UCB (SCLUCB), that respect the baseline constraints and enjoy probabilistic regret bounds of order O(\sqrt{T} \log^{3/2}T) and O(\sqrt{T} \log T), respectively. Notably, the proposed algorithms can be adjusted with only minor modifications to tackle different problem variations, such as constraints with bandit-feedback, or an unknown sequence of baseline actions. We discuss these and other improvements over the state-of-the-art. For instance, compared to existing solutions, we show that SCLTS plays the (non-optimal) baseline action at most O(\log{T}) times (compared to O(\sqrt{T})). Finally, we make connections to another studied form of safety constraints that takes the form of an upper bound on the instantaneous reward. While this incurs additional complexity to the learning process as the optimal action is not guaranteed to belong to the safe set at each round, we show that SCLUCB can properly adjust in this setting via a simple modification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题