论文标题
关于状态变量,匪徒问题和POMDP
On State Variables, Bandit Problems and POMDPs
论文作者
论文摘要
状态变量很容易成为顺序决策问题的最微妙的维度。在积极学习问题(bandit问题)的背景下,决策会影响我们观察和学习的内容。我们描述了模拟{\ IT任何}顺序决策问题的规范框架,并介绍我们对状态变量的定义,该框架的定义使我们能够主张:任何正确模型的决策问题是Markovian。一个真正的决策问题是(可能)非马克维亚人。
State variables are easily the most subtle dimension of sequential decision problems. This is especially true in the context of active learning problems (bandit problems") where decisions affect what we observe and learn. We describe our canonical framework that models {\it any} sequential decision problem, and present our definition of state variables that allows us to claim: Any properly modeled sequential decision problem is Markovian. We then present a novel two-agent perspective of partially observable Markov decision problems (POMDPs) that allows us to then claim: Any model of a real decision problem is (possibly) non-Markovian. We illustrate these perspectives using the context of observing and treating flu in a population, and provide examples of all four classes of policies in this setting. We close with an indication of how to extend this thinking to multiagent problems.