基于模型的强化学习中无监督动态概括的关系干预方法

论文标题

基于模型的强化学习中无监督动态概括的关系干预方法

A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning

论文作者

Guo, Jixian, Gong, Mingming, Tao, Dacheng

论文摘要

基于模型的增强学习（MBRL）方法的概括是对具有看不见的过渡动态的环境的概括是一个重要但具有挑战性的问题。现有方法试图从过去的过渡段中提取环境指定的信息$ z $，以使动态预测模型可推广到不同的动态。但是，由于环境没有标记，因此提取的信息不可避免地包含与过渡段动态无关的冗余信息，因此在同一环境中无法维持$ z $：$ z $的关键属性相似，而在不同的环境中则应该相似。结果，学习的动力学预测函数将偏离真正的概括能力。为了解决此问题，我们引入了一个介入预测模块，以估计两个估计的$ \ hat {z} _i，\ hat {z} _J $属于同一环境的概率。此外，通过在单个环境中利用$ z $的不变性，提出了一个关系负责人，以从同一环境中实施$ \ hat {z} $之间的相似性。结果，冗余信息将在$ \ hat {z} $中减少。我们从经验上表明，与以前的方法相比，由我们的方法估算的$ \ hat {z} $享受的冗余信息少，并且这种$ \ hat {z} $可以显着减少动力学预测错误，并改善具有未看见动力学的新型新环境上基于模型的RL方法的性能。该方法的代码可在\ url {https://github.com/cr-gjx/ria}中获得。

The generalization of model-based reinforcement learning (MBRL) methods to environments with unseen transition dynamics is an important yet challenging problem. Existing methods try to extract environment-specified information $Z$ from past transition segments to make the dynamics prediction model generalizable to different dynamics. However, because environments are not labelled, the extracted information inevitably contains redundant information unrelated to the dynamics in transition segments and thus fails to maintain a crucial property of $Z$: $Z$ should be similar in the same environment and dissimilar in different ones. As a result, the learned dynamics prediction function will deviate from the true one, which undermines the generalization ability. To tackle this problem, we introduce an interventional prediction module to estimate the probability of two estimated $\hat{z}_i, \hat{z}_j$ belonging to the same environment. Furthermore, by utilizing the $Z$'s invariance within a single environment, a relational head is proposed to enforce the similarity between $\hat{Z}$ from the same environment. As a result, the redundant information will be reduced in $\hat{Z}$. We empirically show that $\hat{Z}$ estimated by our method enjoy less redundant information than previous methods, and such $\hat{Z}$ can significantly reduce dynamics prediction errors and improve the performance of model-based RL methods on zero-shot new environments with unseen dynamics. The codes of this method are available at \url{https://github.com/CR-Gjx/RIA}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题