专家监督的强化学习用于离线政策学习和评估

论文标题

专家监督的强化学习用于离线政策学习和评估

Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation

论文作者

Sonabend-W, Aaron, Lu, Junwei, Celi, Leo A., Cai, Tianxi, Szolovits, Peter

论文摘要

离线增强学习（RL）是一种在直接探索或不可行的环境中学习最佳政策的有前途的方法。但是，在实践中采用此类政策通常是具有挑战性的，因为它们在应用程序环境中很难解释，并且缺乏对学习的政策价值及其决策的不确定性的衡量标准。为了克服这些问题，我们提出了一个专家监督的RL（ESRL）框架，该框架使用不确定性量化进行离线政策学习。特别是，我们有三个贡献：1）该方法可以通过假设检验学习安全，最佳的策略，2）ESRL允许对应用程序上下文量身定制的不同级别的风险避开风险，而最后，3）我们提出了一种通过后验分布来解释每个州的ESRL策略的方法，并使用此框架使用此框架来计算偏低的价值功能函数函数函数posterors。我们为我们的估计器提供理论保证，并遗憾与RL后验采样一致（PSRL）。 ESRL的样本效率与所选风险规避阈值和行为政策的质量无关。

Offline Reinforcement Learning (RL) is a promising approach for learning optimal policies in environments where direct exploration is expensive or unfeasible. However, the adoption of such policies in practice is often challenging, as they are hard to interpret within the application context, and lack measures of uncertainty for the learned policy value and its decisions. To overcome these issues, we propose an Expert-Supervised RL (ESRL) framework which uses uncertainty quantification for offline policy learning. In particular, we have three contributions: 1) the method can learn safe and optimal policies through hypothesis testing, 2) ESRL allows for different levels of risk averse implementations tailored to the application context, and finally, 3) we propose a way to interpret ESRL's policy at every state through posterior distributions, and use this framework to compute off-policy value function posteriors. We provide theoretical guarantees for our estimators and regret bounds consistent with Posterior Sampling for RL (PSRL). Sample efficiency of ESRL is independent of the chosen risk aversion threshold and quality of the behavior policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题