论文标题

嵌套土匪

Nested bandits

论文作者

Martin, Matthieu, Mertikopoulos, Panayotis, Rahier, Thibaud, Zenati, Houssam

论文摘要

在许多在线决策过程中,优化代理人在具有许多固有相似之处的大量替代方案之间进行选择。反过来,这些相似性意味着可能会混淆标准离散选择模型和匪徒算法的损失。我们在嵌套土匪的背景下研究了这个问题,这是一类对抗性的多军匪徒问题,学习者试图在存在许多不同的替代方案的情况下最大程度地减少他们的遗憾,并具有嵌入式(非组合)相似性的层次结构。在这种情况下,基于指数重量蓝图的最佳算法(如对冲,EXP3及其变体)可能会产生巨大的遗憾,因为它们倾向于花费过多的时间来探索与相似,次优成本的无关替代方案。为此,我们提出了一种嵌套的指数权重(新)算法,该算法基于嵌套,分步选择方法对学习者的替代方案进行分层探索。这样一来,我们就获得了一系列紧密的界限,以表明学习者可以有效地解决与替代方案之间高度相似性的在线学习问题,而不会出现红色的巴士 /蓝色巴士悖论。

In many online decision processes, the optimizing agent is called to choose between large numbers of alternatives with many inherent similarities; in turn, these similarities imply closely correlated losses that may confound standard discrete choice models and bandit algorithms. We study this question in the context of nested bandits, a class of adversarial multi-armed bandit problems where the learner seeks to minimize their regret in the presence of a large number of distinct alternatives with a hierarchy of embedded (non-combinatorial) similarities. In this setting, optimal algorithms based on the exponential weights blueprint (like Hedge, EXP3, and their variants) may incur significant regret because they tend to spend excessive amounts of time exploring irrelevant alternatives with similar, suboptimal costs. To account for this, we propose a nested exponential weights (NEW) algorithm that performs a layered exploration of the learner's set of alternatives based on a nested, step-by-step selection method. In so doing, we obtain a series of tight bounds for the learner's regret showing that online learning problems with a high degree of similarity between alternatives can be resolved efficiently, without a red bus / blue bus paradox occurring.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源