论文标题

4位精度的情况:k位推理缩放定律

The case for 4-bit precision: k-bit Inference Scaling Laws

论文作者

Dettmers, Tim, Zettlemoyer, Luke

论文摘要

量化方法减少了表示模型​​中每个参数所需的位数,以较小的内存足迹和推理潜伏期的交易精度。但是,最终模型大小取决于原始模型的参数数量和压缩速率。例如,一个30B 8位模型和60B 4位模型具有相同数量的位,但零击精度可能非常不同。在这项工作中,我们通过开发大语言模型(LLMS)中零拍摄性能的推理缩放定律来研究这种权衡,以确定最大化零拍性能的位精确和模型大小。我们运行35,000多个具有16位输入和K-BIT参数的实验,以检查哪种零射击量化方法在19m至176B参数的尺度上改善了3至8位精度的缩放,跨LLM系列,OPT,Neox/Pythia和GPT-2。我们发现,改善比特级规模的权衡是一项挑战,唯一的改进是使用小块大小 - 将参数分成小的独立量化块 - 所使用的量化数据类型(例如,int vs vs float)。总体而言,我们的发现表明,{4位}精度几乎是总模型位和零拍的准确性的普遍最佳选择。

Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies. In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance. We run more than 35,000 experiments with 16-bit inputs and k-bit parameters to examine which zero-shot quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 176B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size -- splitting the parameters into small independently quantized blocks -- and the quantization data type being used (e.g., Int vs Float). Overall, our findings show that {4-bit} precision is almost universally optimal for total model bits and zero-shot accuracy.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源