论文标题

预训练中的班级多样性的祝福

Blessing of Class Diversity in Pre-training

论文作者

Zhao, Yulai, Chen, Jianshu, Du, Simon S.

论文摘要

本文提出了一项新的统计分析,旨在解释自然语言处理(NLP)中训练技术的最新成就。我们证明,当预训练任务的类(例如,蒙版语言模型任务中的不同单词)的类别是足够多样化的,因为在预训练中最后一个线性层的最小值值(表示为$ \tildeν$)很大,那么预训练就可以显着提高下游任务的样品效率。特别是,我们显示了转移学习过量风险享受$ o \ left(\ frac {1} {\tildeν\ sqrt {n}} \ right)$ rate,与$ o \ left(\ frac {1} {\ sqrt {\ sqrt {m} {m} {m} {m sqrt {m}} \ right right)$ rigation $ ressition $ ressition $ ressition $ ressition $在这里,$ n $是预训练数据的数量,$ m $是下游任务中的数据数,通常是$ n \ gg m $。我们的证明依靠矢量形式的rademacher复杂性链规则来拆卸复合函数类别和修改的自我符合条件。这些技术可能具有独立的兴趣。

This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as $\tildeν$) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an $O\left(\frac{1}{\tildeν \sqrt{n}}\right)$ rate, in contrast to the $O\left(\frac{1}{\sqrt{m}}\right)$ rate in the standard supervised learning. Here, $n$ is the number of pre-training data and $m$ is the number of data in the downstream task, and typically $n \gg m$. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源