奖励有偏见的最大似然估计

论文标题

奖励有偏见的最大似然估计

Reward Biased Maximum Likelihood Estimation for Reinforcement Learning

论文作者

Mete, Akshay, Singh, Rahul, Liu, Xi, Kumar, P. R.

论文摘要

提出了对马尔可夫链的自适应控制的奖励偏向最大似然估计（RBMLE），以克服自适应控制的基本“封闭式确定性问题”，“双重控制问题”，或者同时，同时称为“探索vs.剥削问题”。它利用了这样的关键观察，即，由于最大似然参数估计器可以渐近地识别确定性当量方法下的闭合转换概率，因此限制参数估计值必须具有最佳的奖励，而最佳奖励比真实但未知系统可获得的最佳奖励。因此，它提出了一种反向反向偏置，而有利于具有更大最佳奖励的参数，从而解决了上述基本问题的解决方案。因此，它提出了一种乐观的方法，即以较大的最佳奖励（现在被称为“面对不确定性”的乐观主义”而偏爱参数。在各种情况下，RBMLE方法已被证明是最佳的长期平均奖励。但是，现代注意力集中在“遗憾”或有限时间表现的更加精细的概念上。对多臂随机匪徒和线性上下文匪徒的RBMLE的最新分析表明，它不仅具有最先进的遗憾，而且还表现出与当前最佳竞争者相当或更好的经验表现，并且导致了惊人的简单索引策略。在此激励的情况下，我们研究了RBMLE的有限时间表现，用于加强学习任务，涉及对未知马尔可夫决策过程的最佳控制问题的一般问题。我们表明，它在$ t $ step的时间范围内对$ \ mathcal {o}（\ log t）$的遗憾，类似于最新的算法。模拟研究表明，RBMLE的表现优于其他算法，例如UCRL2和Thompson采样。

The Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed to overcome the central obstacle of what is variously called the fundamental "closed-identifiability problem" of adaptive control, the "dual control problem", or, contemporaneously, the "exploration vs. exploitation problem". It exploited the key observation that since the maximum likelihood parameter estimator can asymptotically identify the closed-transition probabilities under a certainty equivalent approach, the limiting parameter estimates must necessarily have an optimal reward that is less than the optimal reward attainable for the true but unknown system. Hence it proposed a counteracting reverse bias in favor of parameters with larger optimal rewards, providing a solution to the fundamental problem alluded to above. It thereby proposed an optimistic approach of favoring parameters with larger optimal rewards, now known as "optimism in the face of uncertainty". The RBMLE approach has been proved to be long-term average reward optimal in a variety of contexts. However, modern attention is focused on the much finer notion of "regret", or finite-time performance. Recent analysis of RBMLE for multi-armed stochastic bandits and linear contextual bandits has shown that it not only has state-of-the-art regret, but it also exhibits empirical performance comparable to or better than the best current contenders, and leads to strikingly simple index policies. Motivated by this, we examine the finite-time performance of RBMLE for reinforcement learning tasks that involve the general problem of optimal control of unknown Markov Decision Processes. We show that it has a regret of $\mathcal{O}( \log T)$ over a time horizon of $T$ steps, similar to state-of-the-art algorithms. Simulation studies show that RBMLE outperforms other algorithms such as UCRL2 and Thompson Sampling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题