SwiftPruner：加强进化修剪以有效的相关性

论文标题

SwiftPruner：加强进化修剪以有效的相关性

SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance

论文作者

Zhang, Li Lyna, Homma, Youkow, Wang, Yujing, Wu, Min, Yang, Mao, Zhang, Ruofei, Cao, Ting, Shen, Wei

论文摘要

AD相关性建模在包括Microsoft Bing在内的在线广告系统中起着至关重要的作用。为了利用强大的变压器在这种低延迟设置中，许多现有方法脱机执行广告端计算。尽管有效，这些方法无法为冷启动广告提供，从而导致对此类广告的相关性预测不佳。这项工作旨在通过结构化的修剪设计一种新的低延迟BERT，以在CPU平台上赋予对冷启动广告相关性的实时推断。我们的挑战是，以前的方法通常将变压器的所有层都缩小到高均匀的稀疏度，从而产生无法以可接受的精度达到令人满意的推理速度的模型。在本文中，我们提出了SwiftPruner - 一个有效的框架，利用基于进化的搜索自动找到最佳的层稀疏BERT模型在所需的延迟约束下。与进行随机突变的现有进化算法不同，我们提出了一个具有潜伏意见的多目标奖励的增强突变器，以进行更好的突变，以有效地搜索层稀疏模型的大空间。广泛的实验表明，与均匀的稀疏基线和最先进的搜索方法相比，我们的方法始终达到更高的ROC AUC和更低的潜伏度。值得注意的是，根据我们在1900年的延迟需求，SwiftPruner的AUC比Bert-Mini在大规模现实世界中的最先进的稀疏基线高0.86％。在线A/B测试表明，我们的模型还达到了有缺陷的冷启动广告的比例，并获得了令人满意的实时服务延迟。

Ad relevance modeling plays a critical role in online advertising systems including Microsoft Bing. To leverage powerful transformers like BERT in this low-latency setting, many existing approaches perform ad-side computations offline. While efficient, these approaches are unable to serve cold start ads, resulting in poor relevance predictions for such ads. This work aims to design a new, low-latency BERT via structured pruning to empower real-time online inference for cold start ads relevance on a CPU platform. Our challenge is that previous methods typically prune all layers of the transformer to a high, uniform sparsity, thereby producing models which cannot achieve satisfactory inference speed with an acceptable accuracy. In this paper, we propose SwiftPruner - an efficient framework that leverages evolution-based search to automatically find the best-performing layer-wise sparse BERT model under the desired latency constraint. Different from existing evolution algorithms that conduct random mutations, we propose a reinforced mutator with a latency-aware multi-objective reward to conduct better mutations for efficiently searching the large space of layer-wise sparse models. Extensive experiments demonstrate that our method consistently achieves higher ROC AUC and lower latency than the uniform sparse baseline and state-of-the-art search methods. Remarkably, under our latency requirement of 1900us on CPU, SwiftPruner achieves a 0.86% higher AUC than the state-of-the-art uniform sparse baseline for BERT-Mini on a large scale real-world dataset. Online A/B testing shows that our model also achieves a significant 11.7% cut in the ratio of defective cold start ads with satisfactory real-time serving latency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题