论文标题
带有背包问题的土匪中的非单调资源利用
Non-monotonic Resource Utilization in the Bandits with Knapsacks Problem
论文作者
论文摘要
带有背包(BWK)的匪徒是不确定性下的顺序决策的有影响力的模型,它包含了资源消耗限制。在每一轮中,决策者都会观察到一个由奖励和非负资源消耗的向量组成的结果,并且每个资源的预算都会因其消费而降低。在本文中,我们介绍了随机BWK问题的自然概括,该问题允许非单调资源利用。在每一轮中,决策者都观察到一个结果,该结果包括奖励和资源漂移的向量,这些结果可能是正,负或零,并且每个资源的预算都会因其漂移而增加。我们的主要结果是马尔可夫决策过程(MDP)政策,当决策者知道真正的结果分布时,对线性编程(LP)放松感到不断遗憾。我们以此为基础开发一种学习算法,当决策者不知道真正的结果分布时,它对同样的LP放松产生了对数遗憾。我们还提出了从BWK到模型的减少,显示了我们的遗憾与现有结果相匹配。
Bandits with knapsacks (BwK) is an influential model of sequential decision-making under uncertainty that incorporates resource consumption constraints. In each round, the decision-maker observes an outcome consisting of a reward and a vector of nonnegative resource consumptions, and the budget of each resource is decremented by its consumption. In this paper we introduce a natural generalization of the stochastic BwK problem that allows non-monotonic resource utilization. In each round, the decision-maker observes an outcome consisting of a reward and a vector of resource drifts that can be positive, negative or zero, and the budget of each resource is incremented by its drift. Our main result is a Markov decision process (MDP) policy that has constant regret against a linear programming (LP) relaxation when the decision-maker knows the true outcome distributions. We build upon this to develop a learning algorithm that has logarithmic regret against the same LP relaxation when the decision-maker does not know the true outcome distributions. We also present a reduction from BwK to our model that shows our regret bound matches existing results.