量化的神经网络推断和精确批处理

论文标题

量化的神经网络推断和精确批处理

Quantized Neural Network Inference with Precision Batching

论文作者

Lam, Maximilian, Yedidia, Zachary, Banbury, Colby, Reddi, Vijay Janapa

论文摘要

我们提出了PrecisionBatching，这是一种量化的推理算法，用于在不需要重新校准或重新校准的情况下在传统硬件平台上加速神经网络执行。 PrecisionBatching将神经网络分解为单个位层层，并使用快速的1位操作积累它们，同时将激活完全精确地保持。精确绘制不仅可以促进在低位（<8位）下进行量化的推理，而无需重新训练/重新校准，而且还可以促进1）使传统的硬件平台能够实现在量化量上强的粒度粒度上实现推理加速的能力（例如，执行时间），可以通过在数量上进行数字进行数字来竞争时的速度和速度，以使得精确地进行竞争。 Across a variety of applications (MNIST, language modeling, natural language inference) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8x on a GPU within a < 1% error margin of the full precision baseline, outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same error tolerance.

We present PrecisionBatching, a quantized inference algorithm for speeding up neural network execution on traditional hardware platforms at low bitwidths without the need for retraining or recalibration. PrecisionBatching decomposes a neural network into individual bitlayers and accumulates them using fast 1-bit operations while maintaining activations in full precision. PrecisionBatching not only facilitates quantized inference at low bitwidths (< 8 bits) without the need for retraining/recalibration, but also 1) enables traditional hardware platforms the ability to realize inference speedups at a finer granularity of quantization (e.g: 1-16 bit execution) and 2) allows accuracy and speedup tradeoffs at runtime by exposing the number of bitlayers to accumulate as a tunable parameter. Across a variety of applications (MNIST, language modeling, natural language inference) and neural network architectures (fully connected, RNN, LSTM), PrecisionBatching yields end-to-end speedups of over 8x on a GPU within a < 1% error margin of the full precision baseline, outperforming traditional 8-bit quantized inference by over 1.5x-2x at the same error tolerance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题