论文标题
随机线性土匪具有保护子空间
Stochastic Linear Bandits with Protected Subspace
论文作者
论文摘要
我们研究了随机线性匪徒问题的一种变体,其中我们优化线性目标函数,但奖励仅与未知子空间(我们解释为\ textit {受保护的空间})仅具有正交性,只有给定零阶的随机随机性甲骨文对物镜本身和受保护的子空间的访问。特别是,在每回合中,学习者必须选择是否查询目标或受保护的子空间以及选择动作。我们的算法源自OFUL原理,使用一些查询来获取受保护空间的估计,并且(几乎在所有回合中)就此空间设定的置信度乐观地发挥了作用。我们提供了一个$ \ tilde {o}(sd \ sqrt {t})$遗憾的上限,如果动作空间是$ \ mathbb {r}^d $,$ s <d $的完整单位球,是受保护子空间的尺寸,而$ t $是时间hive hive hive hive hive hive hive hive hive。此外,我们证明了离散的动作空间可以通过乐观的算法导致线性遗憾,从而增强了某些环境中乐观的亚ub典意。我们还表明,保护约束意味着,对于某些设置,没有一致的算法可以遗憾地小于$ω(t^{3/4})。$ $我们最终通过合成和真实的数据集从经验上验证了我们的结果。
We study a variant of the stochastic linear bandit problem wherein we optimize a linear objective function but rewards are accrued only orthogonal to an unknown subspace (which we interpret as a \textit{protected space}) given only zero-order stochastic oracle access to both the objective itself and protected subspace. In particular, at each round, the learner must choose whether to query the objective or the protected subspace alongside choosing an action. Our algorithm, derived from the OFUL principle, uses some of the queries to get an estimate of the protected space, and (in almost all rounds) plays optimistically with respect to a confidence set for this space. We provide a $\tilde{O}(sd\sqrt{T})$ regret upper bound in the case where the action space is the complete unit ball in $\mathbb{R}^d$, $s < d$ is the dimension of the protected subspace, and $T$ is the time horizon. Moreover, we demonstrate that a discrete action space can lead to linear regret with an optimistic algorithm, reinforcing the sub-optimality of optimism in certain settings. We also show that protection constraints imply that for certain settings, no consistent algorithm can have a regret smaller than $Ω(T^{3/4}).$ We finally empirically validate our results with synthetic and real datasets.