论文标题

可解释的顺序优化的动态内存

Dynamic Memory for Interpretable Sequential Optimisation

论文作者

Chennu, Srivas, Maher, Andrew, Martin, Jamie, Prabanantham, Subash

论文摘要

强化学习进行推荐和实验的现实应用面临的一个实际挑战:不同匪徒的相对奖励可以在学习代理的一生中发展。要处理这些非平稳案件,代理商必须忘记一些历史知识,因为这可能不再与最小化的遗憾有关。我们提出了一种处理非平稳性的解决方案,该解决方案适合于大规模部署,为业务运营商提供自动化的自适应优化。我们的解决方案旨在提供可以被人类信任的可解释学习,同时响应非平稳性以最大程度地减少遗憾。为此,我们开发了一种自适应的贝叶斯学习剂,该学习者采用了一种新型的动态记忆形式。它可以通过统计假设测试来实现可解释性,通过在比较奖励并动态调整其内存以实现此功能时,通过统计能力的设定点来实现统计能力的设定点。根据设计,代理对不同种类的非平稳性不可知。使用数值模拟,我们将其绩效与现有提案进行比较,并表明,在多个非平稳的方案下,我们的代理人正确地适应了真实奖励的实际变化。在所有强盗解决方案中,学习和实现最大表现之间都有明确的权衡。与另一种类似强大的方法相比,我们的解决方案正处于此权衡方面的不同点:我们优先考虑解释性,这依赖于更多的学习,而付出了一些遗憾。我们描述了自动优化的大规模部署与服务的大规模部署的体系结构,在这种情况下,我们的代理人可以在适应不断变化的环境的同时可解释性。

Real-world applications of reinforcement learning for recommendation and experimentation faces a practical challenge: the relative reward of different bandit arms can evolve over the lifetime of the learning agent. To deal with these non-stationary cases, the agent must forget some historical knowledge, as it may no longer be relevant to minimise regret. We present a solution to handling non-stationarity that is suitable for deployment at scale, to provide business operators with automated adaptive optimisation. Our solution aims to provide interpretable learning that can be trusted by humans, whilst responding to non-stationarity to minimise regret. To this end, we develop an adaptive Bayesian learning agent that employs a novel form of dynamic memory. It enables interpretability through statistical hypothesis testing, by targeting a set point of statistical power when comparing rewards and adjusting its memory dynamically to achieve this power. By design, the agent is agnostic to different kinds of non-stationarity. Using numerical simulations, we compare its performance against an existing proposal and show that, under multiple non-stationary scenarios, our agent correctly adapts to real changes in the true rewards. In all bandit solutions, there is an explicit trade-off between learning and achieving maximal performance. Our solution sits on a different point on this trade-off when compared to another similarly robust approach: we prioritise interpretability, which relies on more learning, at the cost of some regret. We describe the architecture of a large-scale deployment of automatic optimisation-as-a-service where our agent achieves interpretability whilst adapting to changing circumstances.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源