对比性蒸馏是转移学习的样本有效的自制损失政策

论文标题

对比性蒸馏是转移学习的样本有效的自制损失政策

Contrastive Distillation Is a Sample-Efficient Self-Supervised Loss Policy for Transfer Learning

论文作者

Lengerich, Chris, Synnaeve, Gabriel, Zhang, Amy, Leather, Hugh, Shuster, Kurt, Charton, François, Redwood, Charysse

论文摘要

RL的传统方法直接从情节决策中学习了决策政策，同时缓慢而隐含地学习了概括所需的构图表示语义。虽然已经采用了一些方法来通过辅助自我监督损失来完善表示，同时学习决策政策，但学习组成的组成表示形式，与手工设计和无上下文独立的自我监督损失（多视图）相对缓慢地适应了实际上的分布，并且在许多非静脉下的分布都相对较慢，并且在逐渐分布范围内，并跨越了时间。相比之下，监督语言模型级联表明灵活地适应了许多不同的流形，以及自主任务转移所需的自我学习的提示。但是，迄今为止，诸如少数学习和微调之类的语言模型的转移方法仍然需要使用自学方法的人进行监督和转移学习。我们提出了一种称为“对比蒸馏”的自制损失政策，该政策表现出具有高度信息的潜在变量，并具有从权重到代币的源和目标任务。我们展示了这是如何胜过转移学习的常见方法，并提出了一个有用的设计轴，以交易以进行在线转移的通用性。对比度蒸馏通过从记忆中进行采样来改善对比度，并提出了一种简单的算法，以比随机抽样更有效地对对比度损失进行更有效的负面示例。

Traditional approaches to RL have focused on learning decision policies directly from episodic decisions, while slowly and implicitly learning the semantics of compositional representations needed for generalization. While some approaches have been adopted to refine representations via auxiliary self-supervised losses while simultaneously learning decision policies, learning compositional representations from hand-designed and context-independent self-supervised losses (multi-view) still adapts relatively slowly to the real world, which contains many non-IID subspaces requiring rapid distribution shift in both time and spatial attention patterns at varying levels of abstraction. In contrast, supervised language model cascades have shown the flexibility to adapt to many diverse manifolds, and hints of self-learning needed for autonomous task transfer. However, to date, transfer methods for language models like few-shot learning and fine-tuning still require human supervision and transfer learning using self-learning methods has been underexplored. We propose a self-supervised loss policy called contrastive distillation which manifests latent variables with high mutual information with both source and target tasks from weights to tokens. We show how this outperforms common methods of transfer learning and suggests a useful design axis of trading off compute for generalizability for online transfer. Contrastive distillation is improved through sampling from memory and suggests a simple algorithm for more efficiently sampling negative examples for contrastive losses than random sampling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题