半监督政策加强学习

论文标题

半监督政策加强学习

Semi-Supervised Off Policy Reinforcement Learning

论文作者

Sonabend-W, Aaron, Laha, Nilanjana, Ananthakrishnan, Ashwin N., Cai, Tianxi, Mukherjee, Rajarshi

论文摘要

强化学习（RL）在估计考虑患者异质性的顺序治疗策略方面表现出了巨大的成功。但是，被用作增强学习方法的奖励的健康结果信息通常不是很好地编码，而是嵌入在临床笔记中。提取精确的结果信息是一项资源密集型任务，因此大多数可用的良好批准的人群都很小。为了解决这个问题，我们提出了一种半监督的学习（SSL）方法，该方法有效地利用了一个小型标记的数据，并观察到了真正的结果，并具有带有结果代理的大型未标记的数据。特别是，我们提出了一种半监督，有效的方法来学习Q学习，并双重稳健地估算了政策价值估计。将SSL推广到顺序治疗方案带来了有趣的挑战：1）Q学习的特征分布尚不清楚，因为它包括以前的结果。 2）我们在修改后的SSL框架中利用的替代变量可预测结果，但对最佳策略或价值功能没有信息。我们为我们的Q功能和价值函数估计器提供理论结果，以了解从SSL获得的程度效率。我们的方法至少与监督方法一样有效，而且安全的效率与插图模型的误解一样可靠。

Reinforcement learning (RL) has shown great success in estimating sequential treatment strategies which take into account patient heterogeneity. However, health-outcome information, which is used as the reward for reinforcement learning methods, is often not well coded but rather embedded in clinical notes. Extracting precise outcome information is a resource intensive task, so most of the available well-annotated cohorts are small. To address this issue, we propose a semi-supervised learning (SSL) approach that efficiently leverages a small sized labeled data with true outcome observed, and a large unlabeled data with outcome surrogates. In particular, we propose a semi-supervised, efficient approach to Q-learning and doubly robust off policy value estimation. Generalizing SSL to sequential treatment regimes brings interesting challenges: 1) Feature distribution for Q-learning is unknown as it includes previous outcomes. 2) The surrogate variables we leverage in the modified SSL framework are predictive of the outcome but not informative to the optimal policy or value function. We provide theoretical results for our Q-function and value function estimators to understand to what degree efficiency can be gained from SSL. Our method is at least as efficient as the supervised approach, and moreover safe as it robust to mis-specification of the imputation models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题