通过基于反事实的数据增强样品的增强式学习

论文标题

通过基于反事实的数据增强样品的增强式学习

Sample-Efficient Reinforcement Learning via Counterfactual-Based Data Augmentation

论文作者

Lu, Chaochao, Huang, Biwei, Wang, Ke, Hernández-Lobato, José Miguel, Zhang, Kun, Schölkopf, Bernhard

论文摘要

加强学习（RL）算法通常需要大量的交互数据，并且仅在固定环境中对特定任务表现良好。但是，在某些情况下，例如医疗保健，通常只有很少的记录可供每位患者使用，并且患者可能对同一治疗的反应不同，这阻碍了当前RL算法的应用来学习最佳政策。为了解决机制异质性和相关数据稀缺性的问题，我们提出了一种利用结构性因果模型（SCM）来建模状态动力学的数据有效的RL算法，该算法通过利用跨受试者的共同点和差异来估算的状态动力学。学到的SCM使我们能够反合在另一种治疗中进行反合推理会发生什么。它有助于避免真正的（可能有风险的）探索，并减轻有限的经验导致政策有限的问题。我们提出反事实RL算法，以学习人口级和个人级别的政策。我们表明，反事实结果在轻度条件下是可以识别的，并且对基于反事实的增强数据集的q学习收敛到最佳值函数。合成和现实世界数据的实验结果证明了所提出的方法的功效。

Reinforcement learning (RL) algorithms usually require a substantial amount of interaction data and perform well only for specific tasks in a fixed environment. In some scenarios such as healthcare, however, usually only few records are available for each patient, and patients may show different responses to the same treatment, impeding the application of current RL algorithms to learn optimal policies. To address the issues of mechanism heterogeneity and related data scarcity, we propose a data-efficient RL algorithm that exploits structural causal models (SCMs) to model the state dynamics, which are estimated by leveraging both commonalities and differences across subjects. The learned SCM enables us to counterfactually reason what would have happened had another treatment been taken. It helps avoid real (possibly risky) exploration and mitigates the issue that limited experiences lead to biased policies. We propose counterfactual RL algorithms to learn both population-level and individual-level policies. We show that counterfactual outcomes are identifiable under mild conditions and that Q- learning on the counterfactual-based augmented data set converges to the optimal value function. Experimental results on synthetic and real-world data demonstrate the efficacy of the proposed approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题