多军匪徒的终身学习

论文标题

多军匪徒的终身学习

Lifelong Learning in Multi-Armed Bandits

论文作者

Jedor, Matthieu, Louëdec, Jonathan, Perchet, Vianney

论文摘要

不断学习并利用从先前任务中积累的知识以提高未来的绩效是一个长期存在的机器学习问题。在本文中，我们研究了多军匪徒框架中的问题，目的是最大程度地减少一系列任务所产生的遗憾。虽然大多数强盗算法被设计为低最坏的遗憾，但我们在这里研究了从先前发行版中得出的匪徒实例的平均遗憾，这可能会随着时间的流逝而改变。我们特别关注UCB算法的置信区间调整。我们提出了一种用贪婪算法的强盗方法的匪徒，我们在固定环境和非平稳环境中进行了广泛的实验评估。我们进一步将解决方案应用于致命的匪徒问题，显示了对先前工作的经验改善。

Continuously learning and leveraging the knowledge accumulated from prior tasks in order to improve future performance is a long standing machine learning problem. In this paper, we study the problem in the multi-armed bandit framework with the objective to minimize the total regret incurred over a series of tasks. While most bandit algorithms are designed to have a low worst-case regret, we examine here the average regret over bandit instances drawn from some prior distribution which may change over time. We specifically focus on confidence interval tuning of UCB algorithms. We propose a bandit over bandit approach with greedy algorithms and we perform extensive experimental evaluations in both stationary and non-stationary environments. We further apply our solution to the mortal bandit problem, showing empirical improvement over previous work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题