论文标题
量化的神经网络推断和精确批处理
Quantized Neural Network Inference with Precision Batching
论文作者
论文摘要
我们提出了PrecisionBatching,这是一种量化的推理算法,用于在不需要重新校准或重新校准的情况下在传统硬件平台上加速神经网络执行。 PrecisionBatching将神经网络分解为单个位层层,并使用快速的1位操作积累它们,同时将激活完全精确地保持。精确绘制不仅可以促进在低位(<8位)下进行量化的推理,而无需重新训练/重新校准,而且还可以促进1)使传统的硬件平台能够实现在量化量上强的粒度粒度上实现推理加速的能力(例如,执行时间),可以通过在数量上进行数字进行数字来竞争时的速度和速度,以使得精确地进行竞争。 Across a variety of applications (MNIST, language modeling, natural language inference) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8x on a GPU within a < 1% error margin of the full precision baseline, outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same error tolerance.
We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations while maintaining activations in full precision. PrecisionBatching not only facilitates quantized inference at low bitwidths (< 8 bits) without the need for retraining/recalibration, but also 1) enables traditional hardware platforms the ability to realize inference speedups at a finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers to accumulate as a tunable parameter. Across a variety of applications (MNIST, language modeling, natural language inference) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8x on a GPU within a < 1% error margin of the full precision baseline, outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same error tolerance.