缓存替换为mab，反馈和衰减成本延迟

论文标题

缓存替换为mab，反馈和衰减成本延迟

Cache Replacement as a MAB with Delayed Feedback and Decaying Costs

论文作者

Yusuf, Farzana Beente, Stebliankin, Vitalii, Vietri, Giuseppe, Narasimhan, Giri

论文摘要

受缓存替换问题的启发，我们提出并解决了著名的多武器强盗（MAB）的新变体，从而为改善现有的最新缓存管理方法提供了解决方案。每个ARM（或专家）代表一个独特的缓存替换策略，该策略建议页面上的页面在需要时驱逐缓存。驱逐的反馈是以“错过”的形式出现的，但是在采取行动后不确定的时间，驱逐的成本与响应时间呈颠倒。如果反馈是在延迟的阈值之后出现的，我们将其设置为页面驱逐历史记录的大小。因此，对于超出阈值的延迟，其成本被认为为零。因此，我们以延迟的反馈和腐烂的成本称之为这个问题。我们引入了一种自适应增强学习算法EXP4-DFDC，该算法为问题提供了解决方案。我们得出了EXP4-DFDC的最佳学习率，该学习率定义了勘探和剥削之间的平衡，从理论上讲，我们算法的预期后悔是随着时间的变化而消失的数量。作为一个应用程序，我们表明，Lecar是一种最近表现最佳的机器学习算法，可通过使用我们的配方进行自适应学习来增强Lecar。我们提出了一种改进的自适应版本的LECAR，称为Olecar，其学习率由此处提出的理论推导确定，以最大程度地减少EXP4-DFDC的遗憾。然后，从理论上讲，Lecar和Olecar随着时间的流逝而消失了。

Inspired by the cache replacement problem, we propose and solve a new variant of the well-known multi-armed bandit (MAB), thus providing a solution for improving existing state-of-the-art cache management methods. Each arm (or expert) represents a distinct cache replacement policy, which advises on the page to evict from the cache when needed. Feedback on the eviction comes in the form of a "miss", but at an indeterminate time after the action is taken, and the cost of the eviction is set to be inversely proportional to the response time. The feedback is ignored if it comes after a threshold value for the delay, which we set to be equal to the size of the page eviction history. Thus, for delays beyond the threshold, its cost is assumed to be zero. Consequently, we call this problem with delayed feedback and decaying costs. We introduce an adaptive reinforcement learning algorithm EXP4-DFDC that provides a solution to the problem. We derive an optimal learning rate for EXP4-DFDC that defines the balance between exploration and exploitation and proves theoretically that the expected regret of our algorithm is a vanishing quantity as a function of time. As an application, we show that LeCaR, a recent top-performing machine learning algorithm for cache replacement, can be enhanced with adaptive learning using our formulations. We present an improved adaptive version of LeCaR, called OLeCaR, with the learning rate set as determined by the theoretical derivation presented here to minimize regret for EXP4-DFDC. It then follows that LeCaR and OLeCaR are theoretically guaranteed to have vanishing regret over time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题