论文标题
随机线性土匪的问题复杂性自适应模型选择
Problem-Complexity Adaptive Model Selection for Stochastic Linear Bandits
论文作者
论文摘要
我们考虑了两个流行的随机线性匪徒设置的模型选择问题,并提出了适应未知问题复杂性的算法。在第一个环境中,我们考虑$ k $武装的混合匪徒,其中arm $ i \ in [k] $的平均奖励为$μ_i+ \langleα_{i,t},tim tin_ {i,tim t $α_{i,t} \ in \ in \ in \ mathbb {rangbb in \ mathbb {range peacte –1 $ n $ n $ ception and $ n $ n of $α___________________________________________。 $θ^*$是未知参数。我们将$ \ |θ^*\ | $定义为问题的复杂性,并考虑一系列嵌套的假设类,每个类别在$ \ | |θ^*\ | $上放置了不同的上限。利用这一点,我们提出了自适应线性匪徒(ALB),这是一种基于新型阶段的算法,可适应真实的问题复杂性,$ \ |θ^*\ | $。我们表明,$ o(\ |θ^*\ | \ sqrt {t})$的Alb scarieves遗憾的缩放率,其中$ \ |θ^*\ | $是apriori未知。作为推论,当$θ^*= 0 $时,Alb在没有这种知识的情况下,为简单的Bandit算法恢复了Minimax后悔。 ALB是第一种使用参数标准作为线性匪徒的模型截面标准的算法。先前的最先进算法\ cite {osom}实现了$ O(l \ sqrt {t})$的遗憾,其中$ l $是$ \ | thon |θ^*\ | $上的上限,作为对问题的输入。在第二个设置中,我们考虑了标准的线性匪徒问题(可能是无限的武器数量),在该算法中,$ d^* \ leq d $表示的$θ^* $的稀疏性是算法所未知的。将$ d^*$定义为问题的复杂性,我们表明Alb able able $ o(d^*\ sqrt {t})$遗憾,与知道真正的稀疏级别的甲骨文相匹配。然后将此方法扩展到有限的许多武器的情况下,并证明了类似的结果。这是实现此类模型选择保证的第一种算法。我们通过合成和真实数据实验进一步验证我们的结果。
We consider the problem of model selection for two popular stochastic linear bandit settings, and propose algorithms that adapts to the unknown problem complexity. In the first setting, we consider the $K$ armed mixture bandits, where the mean reward of arm $i \in [K]$, is $μ_i+ \langle α_{i,t},θ^* \rangle $, with $α_{i,t} \in \mathbb{R}^d$ being the known context vector and $μ_i \in [-1,1]$ and $θ^*$ are unknown parameters. We define $\|θ^*\|$ as the problem complexity and consider a sequence of nested hypothesis classes, each positing a different upper bound on $\|θ^*\|$. Exploiting this, we propose Adaptive Linear Bandit (ALB), a novel phase based algorithm that adapts to the true problem complexity, $\|θ^*\|$. We show that ALB achieves regret scaling of $O(\|θ^*\|\sqrt{T})$, where $\|θ^*\|$ is apriori unknown. As a corollary, when $θ^*=0$, ALB recovers the minimax regret for the simple bandit algorithm without such knowledge of $θ^*$. ALB is the first algorithm that uses parameter norm as model section criteria for linear bandits. Prior state of art algorithms \cite{osom} achieve a regret of $O(L\sqrt{T})$, where $L$ is the upper bound on $\|θ^*\|$, fed as an input to the problem. In the second setting, we consider the standard linear bandit problem (with possibly an infinite number of arms) where the sparsity of $θ^*$, denoted by $d^* \leq d$, is unknown to the algorithm. Defining $d^*$ as the problem complexity, we show that ALB achieves $O(d^*\sqrt{T})$ regret, matching that of an oracle who knew the true sparsity level. This methodology is then extended to the case of finitely many arms and similar results are proven. This is the first algorithm that achieves such model selection guarantees. We further verify our results via synthetic and real-data experiments.