论文标题
针对上下文强盗的协调攻击:基本限制和防御机制
Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms
论文作者
论文摘要
在在线推荐系统的激励下,我们提出了一个问题,即当小分数$α<1/2 $(用户)是任意和对抗性时,在多任务上下文中找到最佳策略。其余优秀用户的剩余部分与$ s $上下文和$ a $ actions(项目)共享上下文匪徒的实例。自然,用户是好还是对抗性的。目的是坚固地学习以尽可能少的用户交互来最大程度地利用奖励的奖励。没有对抗用户,协作过滤中建立的结果表明,$ O(1/ε^2)$每用户交互足以学习良好的策略,这正是因为可以在用户之间共享信息。对抗用户的存在从根本上改变了这种并行化的增益:除非有超级多项式用户数量,否则我们显示了$ \tildeΩ(\ min(s,a)\ cdotα^2 /ε^2)$ {\ it-user}的下限,以了解$ peptim polly的plips of to $ poptim plolige for the $ - per-user}。然后,我们表明我们可以通过对Uni-Variate和高维随机变量使用有效的鲁棒平均估计器来实现$ \ tilde {o}(\ min(s,a)\cdotα/ε^2)$上限。我们还表明,这可以根据上下文的分布来改进。
Motivated by online recommendation systems, we propose the problem of finding the optimal policy in multitask contextual bandits when a small fraction $α< 1/2$ of tasks (users) are arbitrary and adversarial. The remaining fraction of good users share the same instance of contextual bandits with $S$ contexts and $A$ actions (items). Naturally, whether a user is good or adversarial is not known in advance. The goal is to robustly learn the policy that maximizes rewards for good users with as few user interactions as possible. Without adversarial users, established results in collaborative filtering show that $O(1/ε^2)$ per-user interactions suffice to learn a good policy, precisely because information can be shared across users. This parallelization gain is fundamentally altered by the presence of adversarial users: unless there are super-polynomial number of users, we show a lower bound of $\tildeΩ(\min(S,A) \cdot α^2 / ε^2)$ {\it per-user} interactions to learn an $ε$-optimal policy for the good users. We then show we can achieve an $\tilde{O}(\min(S,A)\cdot α/ε^2)$ upper-bound, by employing efficient robust mean estimators for both uni-variate and high-dimensional random variables. We also show that this can be improved depending on the distributions of contexts.