贝叶斯人星球：通过合并贝叶斯推断，重新考虑和改善深度计划网络

论文标题

贝叶斯人星球：通过合并贝叶斯推断，重新考虑和改善深度计划网络

PlaNet of the Bayesians: Reconsidering and Improving Deep Planning Network by Incorporating Bayesian Inference

论文作者

Okada, Masashi, Kosaka, Norio, Taniguchi, Tadahiro

论文摘要

在本文中，我们提出了深度计划网络（行星）的扩展，也称为贝叶斯人（Planet-Bayes）的行星。在某些可观察到的环境中，模型预测控制（MPC）的需求不断增长，在这种环境中，由于缺乏昂贵的传感器，因此无法使用完整信息。星球是实现这种潜在MPC的有前途解决方案，因为它用于通过基于模型的增强学习（MBRL）训练状态空间模型并在潜在空间中进行计划。但是，尚未考虑MBRR文献中提到的最新最新策略，例如涉及不确定性参加培训和计划，从而大大抑制了培训表现。拟议的扩展是根据贝叶斯推论使行星不确定性了解，其中模型和动作不确定性均已纳入。潜在模型中的不确定性使用神经网络集合来表示大约推断模型后代。最佳动作候选者的合奏还用于捕获最优性中的多模式不确定性。动作合奏的概念取决于一般的变分推理MPC（VI-MPC）框架及其实例，即带有轨迹采样（PAETS）的概率动作集合。在本文中，我们扩展了最初在以前的文献中引入的VI-MPC和PAET，以解决部分可观察的情况。我们通过实验性地比较连续控制任务的性能，并得出结论，与行星相比，我们的方法可以一致地改善渐近性能。

In the present paper, we propose an extension of the Deep Planning Network (PlaNet), also referred to as PlaNet of the Bayesians (PlaNet-Bayes). There has been a growing demand in model predictive control (MPC) in partially observable environments in which complete information is unavailable because of, for example, lack of expensive sensors. PlaNet is a promising solution to realize such latent MPC, as it is used to train state-space models via model-based reinforcement learning (MBRL) and to conduct planning in the latent space. However, recent state-of-the-art strategies mentioned in MBRR literature, such as involving uncertainty into training and planning, have not been considered, significantly suppressing the training performance. The proposed extension is to make PlaNet uncertainty-aware on the basis of Bayesian inference, in which both model and action uncertainty are incorporated. Uncertainty in latent models is represented using a neural network ensemble to approximately infer model posteriors. The ensemble of optimal action candidates is also employed to capture multimodal uncertainty in the optimality. The concept of the action ensemble relies on a general variational inference MPC (VI-MPC) framework and its instance, probabilistic action ensemble with trajectory sampling (PaETS). In this paper, we extend VI-MPC and PaETS, which have been originally introduced in previous literature, to address partially observable cases. We experimentally compare the performances on continuous control tasks, and conclude that our method can consistently improve the asymptotic performance compared with PlaNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题