论文标题
均值差异设置下的多层匪徒问题
Multiarmed Bandits Problem Under the Mean-Variance Setting
论文作者
论文摘要
经典的多军匪徒(MAB)问题涉及学习者和K独立武器的集合,每个武器都有其自身未知的独立奖励分布。在有限数量的回合中,学习者选择一个ARM并接收新信息。学习者经常面临探索 - 探索困境:通过以估计奖励的最高奖励来利用当前信息,而不是探索所有武器以收集更多的奖励信息。设计目标旨在最大程度地提高所有回合的预期累积奖励。但是,这样的目标并不能说明风险回报的权衡,这通常是在许多应用领域的基本戒律,最著名的是金融和经济学。在本文中,我们建立在Sani等人的基础上。 (2012年),并将经典的mAB问题扩展到均值差异设置。具体来说,我们放松了Sani等人中独立武器和有限奖励的假设。 (2012年)考虑下高斯武器。我们介绍了风险意识的较低置信约束(RALCB)算法来解决该问题,并研究其某些特性。最后,我们进行了许多数值模拟,以证明在独立和依赖方案中,我们建议的方法的性能比Sani等人建议的算法更好。 (2012年)。
The classical multi-armed bandit (MAB) problem involves a learner and a collection of K independent arms, each with its own ex ante unknown independent reward distribution. At each one of a finite number of rounds, the learner selects one arm and receives new information. The learner often faces an exploration-exploitation dilemma: exploiting the current information by playing the arm with the highest estimated reward versus exploring all arms to gather more reward information. The design objective aims to maximize the expected cumulative reward over all rounds. However, such an objective does not account for a risk-reward tradeoff, which is often a fundamental precept in many areas of applications, most notably in finance and economics. In this paper, we build upon Sani et al. (2012) and extend the classical MAB problem to a mean-variance setting. Specifically, we relax the assumptions of independent arms and bounded rewards made in Sani et al. (2012) by considering sub-Gaussian arms. We introduce the Risk Aware Lower Confidence Bound (RALCB) algorithm to solve the problem, and study some of its properties. Finally, we perform a number of numerical simulations to demonstrate that, in both independent and dependent scenarios, our suggested approach performs better than the algorithm suggested by Sani et al. (2012).