论文标题
使用Voronoi树进行自适应离散化,以进行连续行动POMDPS
Adaptive Discretization using Voronoi Trees for Continuous-Action POMDPs
论文作者
论文摘要
通过连续行动解决部分可观察到的马尔可夫决策过程(POMDP)是具有挑战性的,尤其是对于高维操作空间。为了减轻这一困难,我们提出了一种新的基于采样的在线POMDP求解器,称为使用Voronoi树(Advt)的自适应离散化。它结合了蒙特卡洛树搜索与动作空间的自适应离散化以及乐观的优化,以有效地采样高维连续的动作空间并计算最佳行动。具体而言,我们使用称为Voronoi树的分层分区来适应每个采样信念的动作空间。 Voronoi树是一种二进制空间分区(BSP),它隐式地将单元格的分区保留为从单元中采样的两个点的伏诺图图。这种分区策略可以保持分区和估计每个细胞的大小的成本,即使在高维空间中,需要许多采样点才能很好地覆盖空间。 Advt使用单元格的估计尺寸形成细胞动作值的上限结合,进而使用上等信心来指导蒙特卡洛树搜索扩展并进一步离散动作空间。该策略使Advt能够更好地利用动作空间中的本地信息,从而导致动作空间离散化,从而更适合自适应,因此与现有求解器相比,计算良好的POMDP解决方案的效率更高。对四种基准问题的模拟实验表明,与最新的连续动作POMDP求解器相比,ADVT优于高维连续作用空间的表现要好于高维连续的动作空间。
Solving Partially Observable Markov Decision Processes (POMDPs) with continuous actions is challenging, particularly for high-dimensional action spaces. To alleviate this difficulty, we propose a new sampling-based online POMDP solver, called Adaptive Discretization using Voronoi Trees (ADVT). It uses Monte Carlo Tree Search in combination with an adaptive discretization of the action space as well as optimistic optimization to efficiently sample high-dimensional continuous action spaces and compute the best action to perform. Specifically, we adaptively discretize the action space for each sampled belief using a hierarchical partition which we call a Voronoi tree. A Voronoi tree is a Binary Space Partitioning (BSP) that implicitly maintains the partition of a cell as the Voronoi diagram of two points sampled from the cell. This partitioning strategy keeps the cost of partitioning and estimating the size of each cell low, even in high-dimensional spaces where many sampled points are required to cover the space well. ADVT uses the estimated sizes of the cells to form an upper-confidence bound of the action values of the cell, and in turn uses the upper-confidence bound to guide the Monte Carlo Tree Search expansion and further discretization of the action space. This strategy enables ADVT to better exploit local information in the action space, leading to an action space discretization that is more adaptive, and hence more efficient in computing good POMDP solutions, compared to existing solvers. Experiments on simulations of four types of benchmark problems indicate that ADVT outperforms and scales substantially better to high-dimensional continuous action spaces, compared to state-of-the-art continuous action POMDP solvers.