马尔可夫决策过程中的主动模型估计

论文标题

马尔可夫决策过程中的主动模型估计

Active Model Estimation in Markov Decision Processes

论文作者

Tarbouriech, Jean, Shekhar, Shubhanshu, Pirotta, Matteo, Ghavamzadeh, Mohammad, Lazaric, Alessandro

论文摘要

我们研究有效探索的问题，以学习以马尔可夫决策过程（MDP）为模型的环境模型。在此问题中有效探索要求代理确定估计模型更加困难的区域，然后利用这些知识来收集更多样本。在本文中，我们将此问题进行形式化，介绍了第一种学习动力学估计$ε$的算法，并提供其样本复杂性分析。尽管该算法在大样本政权中具有很强的保证，但在探索的早期阶段的表现往往较差。为了解决这个问题，我们提出了一种基于最大加权熵的算法，该算法源自常识和我们的理论分析。这里的主要思想是覆盖整个州行动空间，与过渡中的噪声成正比的重量。在过渡中，使用许多具有异质噪声的简单域，我们表明我们的基于启发式的算法在小样品方面均优于我们的原始算法和最大熵算法，同时实现与原始算法相似的渐近性能。

We study the problem of efficient exploration in order to learn an accurate model of an environment, modeled as a Markov decision process (MDP). Efficient exploration in this problem requires the agent to identify the regions in which estimating the model is more difficult and then exploit this knowledge to collect more samples there. In this paper, we formalize this problem, introduce the first algorithm to learn an $ε$-accurate estimate of the dynamics, and provide its sample complexity analysis. While this algorithm enjoys strong guarantees in the large-sample regime, it tends to have a poor performance in early stages of exploration. To address this issue, we propose an algorithm that is based on maximum weighted entropy, a heuristic that stems from common sense and our theoretical analysis. The main idea here is to cover the entire state-action space with the weight proportional to the noise in the transitions. Using a number of simple domains with heterogeneous noise in their transitions, we show that our heuristic-based algorithm outperforms both our original algorithm and the maximum entropy algorithm in the small sample regime, while achieving similar asymptotic performance as that of the original algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题