论文标题
DeepTop:MDP和RMAB的深度阈值 - 最佳政策
DeepTOP: Deep Threshold-Optimal Policy for MDPs and RMABs
论文作者
论文摘要
我们考虑学习控制问题的最佳阈值策略的问题。阈值策略通过评估系统状态的要素是否超过一定阈值来做出控制决策,其价值由系统状态的其他元素决定。通过利用阈值策略的单调特性,我们证明他们的政策梯度具有令人惊讶的简单表达方式。我们使用这种简单的表达式来构建一种用于学习最佳阈值策略的非政治演员批评算法。仿真结果表明,由于其能够利用单调属性的能力,我们的政策大大优于其他强化学习算法。此外,我们表明,Whittle Index是一种用于躁动的多臂匪徒问题的强大工具,相当于替代问题的最佳阈值策略。该观察结果导致了一种简单的算法,该算法通过学习替代问题中的最佳阈值策略来找到Whittle索引。仿真结果表明,我们的算法比最近通过间接手段学习小索引的一些研究快得多。
We consider the problem of learning the optimal threshold policy for control problems. Threshold policies make control decisions by evaluating whether an element of the system state exceeds a certain threshold, whose value is determined by other elements of the system state. By leveraging the monotone property of threshold policies, we prove that their policy gradients have a surprisingly simple expression. We use this simple expression to build an off-policy actor-critic algorithm for learning the optimal threshold policy. Simulation results show that our policy significantly outperforms other reinforcement learning algorithms due to its ability to exploit the monotone property. In addition, we show that the Whittle index, a powerful tool for restless multi-armed bandit problems, is equivalent to the optimal threshold policy for an alternative problem. This observation leads to a simple algorithm that finds the Whittle index by learning the optimal threshold policy in the alternative problem. Simulation results show that our algorithm learns the Whittle index much faster than several recent studies that learn the Whittle index through indirect means.