预训练中的班级多样性的祝福

论文标题

预训练中的班级多样性的祝福

Blessing of Class Diversity in Pre-training

论文作者

Zhao, Yulai, Chen, Jianshu, Du, Simon S.

论文摘要

本文提出了一项新的统计分析，旨在解释自然语言处理（NLP）中训练技术的最新成就。我们证明，当预训练任务的类（例如，蒙版语言模型任务中的不同单词）的类别是足够多样化的，因为在预训练中最后一个线性层的最小值值（表示为$ \tildeν$）很大，那么预训练就可以显着提高下游任务的样品效率。特别是，我们显示了转移学习过量风险享受$ o \ left（\ frac {1} {\tildeν\ sqrt {n}} \ right）$ rate，与$ o \ left（\ frac {1} {\ sqrt {\ sqrt {m} {m} {m} {m sqrt {m}} \ right right）$ rigation $ ressition $ ressition $ ressition $ ressition $在这里，$ n $是预训练数据的数量，$ m $是下游任务中的数据数，通常是$ n \ gg m $。我们的证明依靠矢量形式的rademacher复杂性链规则来拆卸复合函数类别和修改的自我符合条件。这些技术可能具有独立的兴趣。

This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tildeν$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tildeν \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.

下载PDF全文

下载文献需遵守相关版权规定

论文标题