论文标题
更新还是不更新?随机分配的延迟非参数匪徒
To update or not to update? Delayed Nonparametric Bandits with Randomized Allocation
论文作者
论文摘要
在上下文匪徒中,延迟的奖励问题在各种实际环境中引起了人们的关注。我们研究了随机分配策略,并提供了对探索探索折衷方案如何受到观察奖励的影响的理解。在随机策略中,探索探索的程度由用户确定的探索概率序列控制。在延迟奖励的情况下,只有在观察到新的奖励时,可以选择在每个时间点更新的原始探索顺序或更新序列之间进行选择,从而导致两种竞争策略。在这项工作中,我们表明,尽管这两种策略都可能导致分配的强大一致性,但该财产范围范围更广泛。但是,对于有限的样本性能,我们说明两种策略都有自己的优势和缺点,具体取决于延迟和基本奖励产生机制的严重性。
Delayed rewards problem in contextual bandits has been of interest in various practical settings. We study randomized allocation strategies and provide an understanding on how the exploration-exploitation tradeoff is affected by delays in observing the rewards. In randomized strategies, the extent of exploration-exploitation is controlled by a user-determined exploration probability sequence. In the presence of delayed rewards, one may choose between using the original exploration sequence that updates at every time point or update the sequence only when a new reward is observed, leading to two competing strategies. In this work, we show that while both strategies may lead to strong consistency in allocation, the property holds for a wider scope of situations for the latter. However, for finite sample performance, we illustrate that both strategies have their own advantages and disadvantages, depending on the severity of the delay and underlying reward generating mechanisms.