要理解学习到阶段的特权特征蒸馏

论文标题

要理解学习到阶段的特权特征蒸馏

Toward Understanding Privileged Features Distillation in Learning-to-Rank

论文作者

Yang, Shuo, Sanghavi, Sujay, Rahmanian, Holakou, Bakus, Jan, Vishwanathan, S. V. N.

论文摘要

在学习到等级的问题中，特权功能是在模型培训期间可用的功能，但在测试时不可用。这种特征自然出现在商品推荐系统中；例如，“用户单击此项目”作为功能可预测离线数据中的“用户购买此项目”，但在线服务期间显然不可用。特权功能的另一个来源是那些太昂贵的在线计算但可行的东西是脱机的。特权功能蒸馏（PFD）是指自然想法：使用所有功能（包括特权的）训练“老师”模型，然后使用它来训练不使用特权功能的“学生”模型。在本文中，我们首先在三个公共排名数据集和一个从亚马逊日志中得出的工业规模排名问题进行了经验研究。我们表明，PFD在所有这些数据集上都超过了几个基线（无缩写，预处理，自我验证和广义蒸馏）。接下来，我们通过经验消融研究和线性模型的理论分析来分析PFD的原因和何时表现良好。两项调查都发现了一个有趣的非主持酮行为：随着特权特征的预测能力增加，最初的学生模型的性能最初会增加，但随后降低。我们表明了后来的绩效下降的原因是，一个非常预测的特权教师会产生较高的差异预测，从而导致较高的差异学生估计和劣等测试表现。

In learning-to-rank problems, a privileged feature is one that is available during model training, but not available at test time. Such features naturally arise in merchandised recommendation systems; for instance, "user clicked this item" as a feature is predictive of "user purchased this item" in the offline data, but is clearly not available during online serving. Another source of privileged features is those that are too expensive to compute online but feasible to be added offline. Privileged features distillation (PFD) refers to a natural idea: train a "teacher" model using all features (including privileged ones) and then use it to train a "student" model that does not use the privileged features. In this paper, we first study PFD empirically on three public ranking datasets and an industrial-scale ranking problem derived from Amazon's logs. We show that PFD outperforms several baselines (no-distillation, pretraining-finetuning, self-distillation, and generalized distillation) on all these datasets. Next, we analyze why and when PFD performs well via both empirical ablation studies and theoretical analysis for linear models. Both investigations uncover an interesting non-monotone behavior: as the predictive power of a privileged feature increases, the performance of the resulting student model initially increases but then decreases. We show the reason for the later decreasing performance is that a very predictive privileged teacher produces predictions with high variance, which lead to high variance student estimates and inferior testing performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题