在学习Whittle Index政策的不安强盗和可扩展的遗憾

论文标题

在学习Whittle Index政策的不安强盗和可扩展的遗憾

On learning Whittle index policy for restless bandits with scalable regret

论文作者

Akbarzadeh, Nima, Mahajan, Aditya

论文摘要

强化学习是一种有吸引力的方法，可以在系统模型未知时根据数据学习良好的资源分配和调度策略。但是，大多数RL算法量表的累积遗憾是$ \ tilde o（\ Mathsf {s} \ sqrt {\ Mathsf {a} t} $ $ \ tilde {o}（\ cdot）$符号隐藏对数项。由于对状态空间大小的线性依赖性，这些遗憾的界限对于资源分配和调度问题而言非常大。在本文中，我们提出了一种基于模型的RL算法，以解决具有可扩展遗憾的此类问题。特别是，我们考虑了一个不安的匪徒模型，并提出了一种基于汤普森采样的学习算法，该学习算法已调整为模型的基础结构。我们介绍了有关Whittle Index政策的拟议算法的遗憾的两个特征。首先，我们表明，对于带有$ n $臂的不安强盗和每次最多的$ m $激活，遗憾要么缩放为$ \ tilde {o}（mn \ sqrt {t}）$或$ \ tilde {o} {o}（n^2 \ sqrt {t}）$，根据奖励模型。其次，在其他技术假设下，我们表明遗憾的缩放为$ \ tilde {o}（n^{1.5} \ sqrt {t}）$或$ \ tilde {o}（\ max \ max \ {m \ {m \ sqrt {n}，n \}，n \}，n \}，n \} \ sqrt {我们提出数值示例，以说明该算法的显着特征。

Reinforcement learning is an attractive approach to learn good resource allocation and scheduling policies based on data when the system model is unknown. However, the cumulative regret of most RL algorithms scales as $\tilde O(\mathsf{S} \sqrt{\mathsf{A} T})$, where $\mathsf{S}$ is the size of the state space, $\mathsf{A}$ is the size of the action space, $T$ is the horizon, and the $\tilde{O}(\cdot)$ notation hides logarithmic terms. Due to the linear dependence on the size of the state space, these regret bounds are prohibitively large for resource allocation and scheduling problems. In this paper, we present a model-based RL algorithm for such problems which has scalable regret. In particular, we consider a restless bandit model, and propose a Thompson-sampling based learning algorithm which is tuned to the underlying structure of the model. We present two characterizations of the regret of the proposed algorithm with respect to the Whittle index policy. First, we show that for a restless bandit with $n$ arms and at most $m$ activations at each time, the regret scales either as $\tilde{O}(mn\sqrt{T})$ or $\tilde{O}(n^2 \sqrt{T})$ depending on the reward model. Second, under an additional technical assumption, we show that the regret scales as $\tilde{O}(n^{1.5} \sqrt{T})$ or $\tilde{O}(\max\{m\sqrt{n}, n\} \sqrt{T})$. We present numerical examples to illustrate the salient features of the algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题