可区分的匪徒探索

论文标题

可区分的匪徒探索

Differentiable Bandit Exploration

论文作者

Boutilier, Craig, Hsu, Chih-Wei, Kveton, Branislav, Mladenov, Martin, Szepesvari, Csaba, Zaheer, Manzil

论文摘要

贝叶斯土匪中的探索政策最大程度地利用了来自某些发行版$ \ Mathcal {p} $的问题实例的平均奖励。在这项工作中，我们使用$ \ Mathcal {p} $的样本学习了未知分布的$ \ Mathcal {p} $的策略。我们的方法是一种元学习的形式，并利用了$ \ Mathcal {p} $的属性，而无需对其形式做出强烈的假设。为此，我们以一种可区分的方式对政策进行参数化，并通过策略梯度进行优化，这种方法是通用且易于实施的方法。我们得出有效的梯度估计器并引入新的差异技术。我们还分析和试验各种匪徒政策类别，包括神经网络和新型软智政策。后者后悔保证，这是我们优化的自然出发点。我们的实验显示了我们方法的多功能性。我们还观察到，神经网络政策只能学习仅通过采样实例表达的隐式偏见。

Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution $\mathcal{P}$. In this work, we learn such policies for an unknown distribution $\mathcal{P}$ using samples from $\mathcal{P}$. Our approach is a form of meta-learning and exploits properties of $\mathcal{P}$ without making strong assumptions about its form. To do this, we parameterize our policies in a differentiable way and optimize them by policy gradients, an approach that is general and easy to implement. We derive effective gradient estimators and introduce novel variance reduction techniques. We also analyze and experiment with various bandit policy classes, including neural networks and a novel softmax policy. The latter has regret guarantees and is a natural starting point for our optimization. Our experiments show the versatility of our approach. We also observe that neural network policies can learn implicit biases expressed only through the sampled instances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题