基于资源约束建议的分层自适应上下文匪徒

论文标题

基于资源约束建议的分层自适应上下文匪徒

Hierarchical Adaptive Contextual Bandits for Resource Constraint based Recommendation

论文作者

Yang, Mengyue, Li, Qingyang, Qin, Zhiwei, Ye, Jieping

论文摘要

上下文多臂强盗（MAB）在各种问题上实现了尖端的性能。但是，当涉及到推荐系统和在线广告等现实情况时，必须考虑探索资源消费。实际上，通常与在环境中执行建议（ARM）相关的非零成本，因此，应通过固定的勘探成本限制来学习该政策。直接学习全球最佳政策是一项挑战，因为这是一个NP困难的问题，并显着使Bandit算法的探索和剥削权衡变得复杂。现有方法着重于通过采用贪婪的政策来解决问题，该政策估算了预期的奖励和成本，并根据每个ARM的预期奖励/成本比率使用历史观察来使用贪婪的选择，直到探索资源耗尽为止。但是，现有方法很难扩展到无限的时间范围，因为在没有更多资源的情况下，学习过程将终止。在本文中，我们提出了一种层次自适应上下文匪徒方法（Hatch），以通过预算约束来进行上下文匪徒的政策学习。 Hatch采用了一种自适应方法来根据剩余的资源/时间和不同用户环境之间奖励分布的估计来分配探索资源。此外，我们利用完整的上下文功能信息来找到最佳的个性化建议。最后，为了证明理论保证，我们提出了遗憾的束缚分析，并证明了孵化的遗憾，遗憾的是低至$ o（\ sqrt {t}）$。实验结果证明了拟议方法对合成数据集和现实世界应用的有效性和效率。

Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems. When it comes to real-world scenarios such as recommendation system and online advertising, however, it is essential to consider the resource consumption of exploration. In practice, there is typically non-zero cost associated with executing a recommendation (arm) in the environment, and hence, the policy should be learned with a fixed exploration cost constraint. It is challenging to learn a global optimal policy directly, since it is a NP-hard problem and significantly complicates the exploration and exploitation trade-off of bandit algorithms. Existing approaches focus on solving the problems by adopting the greedy policy which estimates the expected rewards and costs and uses a greedy selection based on each arm's expected reward/cost ratio using historical observation until the exploration resource is exhausted. However, existing methods are hard to extend to infinite time horizon, since the learning process will be terminated when there is no more resource. In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint. HATCH adopts an adaptive method to allocate the exploration resource based on the remaining resource/time and the estimation of reward distribution among different user contexts. In addition, we utilize full of contextual feature information to find the best personalized recommendation. Finally, in order to prove the theoretical guarantee, we present a regret bound analysis and prove that HATCH achieves a regret bound as low as $O(\sqrt{T})$. The experimental results demonstrate the effectiveness and efficiency of the proposed method on both synthetic data sets and the real-world applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题