DTR强盗：学会以低遗憾做出反应自适应决定

论文标题

DTR强盗：学会以低遗憾做出反应自适应决定

DTR Bandit: Learning to Make Response-Adaptive Decisions With Low Regret

论文作者

Hu, Yichun, Kallus, Nathan

论文摘要

动态治疗方案（DTRS）是个性化的，适应性的，多阶段的治疗计划，可以将治疗决策适应个人的初始特征，又适合随后阶段的每个阶段的中间结果和特征，这在先前的阶段都受到决策的影响。例子包括对糖尿病，癌症和抑郁症等慢性疾病的个性化一线和二线治疗，这些治疗适应患者对一线治疗，疾病进展和个人特征的反应。尽管现有文献主要集中于从离线数据（例如从依次随机试验）中估算最佳DTR，但我们研究了以在线方式开发最佳DTR的问题，其中与每个人的相互作用都会影响我们的累积奖励和我们的数据收集以获取未来学习。我们将其称为DTR匪徒问题。我们提出了一种新颖的算法，通过仔细平衡探索和剥削，可以保证当过渡和奖励模型是线性时，可以实现最佳的遗憾。我们证明了我们的算法及其在合成实验和使用现实世界数据对重大抑郁症的适应性治疗的案例研究中的效果。

Dynamic treatment regimes (DTRs) are personalized, adaptive, multi-stage treatment plans that adapt treatment decisions both to an individual's initial features and to intermediate outcomes and features at each subsequent stage, which are affected by decisions in prior stages. Examples include personalized first- and second-line treatments of chronic conditions like diabetes, cancer, and depression, which adapt to patient response to first-line treatment, disease progression, and individual characteristics. While existing literature mostly focuses on estimating the optimal DTR from offline data such as from sequentially randomized trials, we study the problem of developing the optimal DTR in an online manner, where the interaction with each individual affect both our cumulative reward and our data collection for future learning. We term this the DTR bandit problem. We propose a novel algorithm that, by carefully balancing exploration and exploitation, is guaranteed to achieve rate-optimal regret when the transition and reward models are linear. We demonstrate our algorithm and its benefits both in synthetic experiments and in a case study of adaptive treatment of major depressive disorder using real-world data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题