减轻差异的同时最大化奖励：随时保证改善土匪

论文标题

减轻差异的同时最大化奖励：随时保证改善土匪

Mitigating Disparity while Maximizing Reward: Tight Anytime Guarantee for Improving Bandits

论文作者

Patil, Vishakha, Nair, Vineet, Ghalme, Ganesh, Khan, Arindam

论文摘要

我们研究了改进的多军匪徒（IMAB）问题，其中从手臂获得的奖励随着收到的拉力数量而增加。该模型为教育和就业等领域中许多现实世界中的问题提供了优雅的抽象，在这种领域中，关于机会分配的决定可能会影响社区的未来能力以及它们之间的差异。在这种情况下，决策者必须考虑她的决策对未来奖励的影响，除了随时最大化其累积奖励的标准目标。在许多这些应用中，决策者事先未知的时间范围是在技术上更具挑战性的地平线 - 诺维尔环境中的研究。我们研究了地平线 - 统一环境中两个看似相互冲突的目标之间产生的紧张：a）根据当前的武器奖励，随时最大化累积奖励，b）确保即使最初拥有较低的奖励，具有更好的长期奖励的武器也能获得足够的机会。我们表明，令人惊讶的是，在这种情况下，这两个目标彼此对齐。我们的主要贡献是对IMAB问题的任何时间算法，它可以获得最佳的累积奖励，同时确保手臂在足够的时间内发挥其真正的潜力。由于缺乏机会，我们的算法减轻了最初的差异，并继续拉动手臂直到停止改善。我们通过证明a）IMAB问题的任何算法都证明了我们的算法的最佳性，无论其功利主义，无论多么有效，都必须遭受$ω（t）$ policy遗憾，而$ω（k）$竞争比相对于最佳离线政策，以及b）我们的Algorithm的竞争比率是$ O（K）$ O（K）$。

We study the Improving Multi-Armed Bandit (IMAB) problem, where the reward obtained from an arm increases with the number of pulls it receives. This model provides an elegant abstraction for many real-world problems in domains such as education and employment, where decisions about the distribution of opportunities can affect the future capabilities of communities and the disparity between them. A decision-maker in such settings must consider the impact of her decisions on future rewards in addition to the standard objective of maximizing her cumulative reward at any time. In many of these applications, the time horizon is unknown to the decision-maker beforehand, which motivates the study of the IMAB problem in the technically more challenging horizon-unaware setting. We study the tension that arises between two seemingly conflicting objectives in the horizon-unaware setting: a) maximizing the cumulative reward at any time based on current rewards of the arms, and b) ensuring that arms with better long-term rewards get sufficient opportunities even if they initially have low rewards. We show that, surprisingly, the two objectives are aligned with each other in this setting. Our main contribution is an anytime algorithm for the IMAB problem that achieves the best possible cumulative reward while ensuring that the arms reach their true potential given sufficient time. Our algorithm mitigates the initial disparity due to lack of opportunity and continues pulling an arm till it stops improving. We prove the optimality of our algorithm by showing that a) any algorithm for the IMAB problem, no matter how utilitarian, must suffer $Ω(T)$ policy regret and $Ω(k)$ competitive ratio with respect to the optimal offline policy, and b) the competitive ratio of our algorithm is $O(k)$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题