CPM：一种大规模生成的中文预训练的语言模型

论文标题

CPM：一种大规模生成的中文预训练的语言模型

CPM: A Large-scale Generative Chinese Pre-trained Language Model

论文作者

Zhang, Zhengyan, Han, Xu, Zhou, Hao, Ke, Pei, Gu, Yuxian, Ye, Deming, Qin, Yujia, Su, Yusheng, Ji, Haozhe, Guan, Jian, Qi, Fanchao, Wang, Xiaozhi, Zheng, Yanan, Zeng, Guoyang, Cao, Huanqi, Chen, Shengqi, Li, Daixuan, Sun, Zhenbo, Liu, Zhiyuan, Huang, Minlie, Han, Wentao, Tang, Jie, Li, Juanzi, Zhu, Xiaoyan, Sun, Maosong

论文摘要

事实证明，预训练的语言模型（PLM）对各种下游NLP任务有益。最近，GPT-3具有1750亿参数和570GB培训数据，由于能力很少（甚至零射）学习引起了很多关注。但是，应用GPT-3来解决中国NLP任务仍然具有挑战性，因为GPT-3的培训语料库主要是英语，并且参数尚未公开可用。在这份技术报告中，我们发布了中国预训练的语言模型（CPM），并在中国大规模培训数据上进行生成预培训。据我们所知，具有26亿参数和100GB中国培训数据的CPM是中国最大的培训前语言模型，可以促进几项下游的中国NLP任务，例如对话，论文生成，披肩测试和语言理解。广泛的实验表明，CPM在几次（甚至零射）学习的设置中在许多NLP任务上实现了强劲的性能。代码和参数可在https://github.com/tsinghuaai/cpm-generate上找到。

Pre-trained Language Models (PLMs) have proven to be beneficial for various downstream NLP tasks. Recently, GPT-3, with 175 billion parameters and 570GB training data, drew a lot of attention due to the capacity of few-shot (even zero-shot) learning. However, applying GPT-3 to address Chinese NLP tasks is still challenging, as the training corpus of GPT-3 is primarily English, and the parameters are not publicly available. In this technical report, we release the Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained language model, which could facilitate several downstream Chinese NLP tasks, such as conversation, essay generation, cloze test, and language understanding. Extensive experiments demonstrate that CPM achieves strong performance on many NLP tasks in the settings of few-shot (even zero-shot) learning. The code and parameters are available at https://github.com/TsinghuaAI/CPM-Generate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题