带有背包问题的土匪中的非单调资源利用

论文标题

带有背包问题的土匪中的非单调资源利用

Non-monotonic Resource Utilization in the Bandits with Knapsacks Problem

论文作者

Kumar, Raunak, Kleinberg, Robert

论文摘要

带有背包（BWK）的匪徒是不确定性下的顺序决策的有影响力的模型，它包含了资源消耗限制。在每一轮中，决策者都会观察到一个由奖励和非负资源消耗的向量组成的结果，并且每个资源的预算都会因其消费而降低。在本文中，我们介绍了随机BWK问题的自然概括，该问题允许非单调资源利用。在每一轮中，决策者都观察到一个结果，该结果包括奖励和资源漂移的向量，这些结果可能是正，负或零，并且每个资源的预算都会因其漂移而增加。我们的主要结果是马尔可夫决策过程（MDP）政策，当决策者知道真正的结果分布时，对线性编程（LP）放松感到不断遗憾。我们以此为基础开发一种学习算法，当决策者不知道真正的结果分布时，它对同样的LP放松产生了对数遗憾。我们还提出了从BWK到模型的减少，显示了我们的遗憾与现有结果相匹配。

Bandits with knapsacks (BwK) is an influential model of sequential decision-making under uncertainty that incorporates resource consumption constraints. In each round, the decision-maker observes an outcome consisting of a reward and a vector of nonnegative resource consumptions, and the budget of each resource is decremented by its consumption. In this paper we introduce a natural generalization of the stochastic BwK problem that allows non-monotonic resource utilization. In each round, the decision-maker observes an outcome consisting of a reward and a vector of resource drifts that can be positive, negative or zero, and the budget of each resource is incremented by its drift. Our main result is a Markov decision process (MDP) policy that has constant regret against a linear programming (LP) relaxation when the decision-maker knows the true outcome distributions. We build upon this to develop a learning algorithm that has logarithmic regret against the same LP relaxation when the decision-maker does not know the true outcome distributions. We also present a reduction from BwK to our model that shows our regret bound matches existing results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题