论文标题
塞子:一天之内在单个GPU上训练语言模型
Cramming: Training a Language Model on a Single GPU in One Day
论文作者
论文摘要
语言建模的最新趋势集中在通过缩放来提高性能,并导致了大多数研究人员和从业人员无法实现培训语言模型的环境。虽然社区中的大多数人都在问如何突破极端计算的限制,但我们提出了一个相反的问题:仅一天之内,一个GPU可以在多远? 我们使用基于变压器的语言模型来研究下游的性能,该模型在单个消费者GPU上进行了一天的胶带模型,该模型完全从头开始训练。除了重新分析此情况预训练管道的几乎所有组件外,并提供了接近BERT的性能的修改管道外,我们研究了为什么缩小缩小是很难的,哪些修改实际上可以改善这种情况下的性能。我们提供的证据表明,即使在这种受到限制的环境中,绩效也会遵循大型设置中观察到的缩放定律。通过缩放法律的镜头,我们对培训和体系结构进行了一系列改进,并讨论了它们在有限的计算设置中的优点和实际适用性(或缺乏)。
Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question: How far can we get with a single GPU in just one day? We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.