Qualcomm Innovation Center (QuIC) is at the forefront of enabling low-power inference at the edge through its pioneering model-efficiency research. QuIC has a mission to help migrate the ecosystem toward fixed-point inference. With this goal, QuIC presents the AI Model Efficiency Toolkit (AIMET) - a library that provides advanced quantization and compression techniques for trained neural network models. AIMET enables neural networks to run more efficiently on fixed-point AI hardware accelerators.
Quantized inference is significantly faster than floating point inference. For example, models that we’ve run on the Qualcomm® Hexagon™ DSP rather than on the Qualcomm® Kryo™ CPU have resulted in a 5x to 15x speedup. Plus, an 8-bit model also has a 4x smaller memory footprint relative to a 32-bit model. However, often when quantizing a machine learning model (e.g., from 32-bit floating point to an 8-bit fixed point value), the model accuracy is sacrificed. AIMET solves this problem through novel techniques like data-free quantization that provides state of the art INT8 results as shown in Data-Free Quantization paper ( ICCV’19).
Manual optimization of a neural network for improved efficiency is costly, time-consuming and not scalable with ever increasing AI workloads. AIMET solves this by providing a library that plugs directly into TensorFlow and PyTorch training frameworks for ease of use, allowing developers to call APIs directly from their existing pipelines.
Through a series of simple API calls, AIMET can quantize an existing 32-bit floating-point model to an 8-bit fixed-point model without sacrificing much accuracy and without model fine-tuning. As an example of accuracy maintained, the DFQ method applied to several popular networks, such as MobileNet-v2 and ResNet-50, result in less than 0.9% loss in accuracy all the way down to 8-bit quantization — in an automated way without any training data.
Model | FP32 model | INT8 model with DFQ |
---|---|---|
Mobilenet-v2 (top-1 accuracy) | 71.72% | 71.08% |
Resinet-50 (top-1 accuracy) | 76.05% | 75.45% |
Deeplabv3 (mIoU) | 72.65% | 71.91% |
Data-free quantization enables INT8 inference with very minimal loss in accuracy relative to the FP32 model. |
Through a series of simple API calls, AIMET can also significantly compress models. For popular models, such as Resnet-50 and Resnet-18, compression with spatial SVD plus channel pruning achieves 50% MAC (multiply-accumulate) reduction while retaining accuracy within approximately 1% of the original uncompressed model.
Model (FP32) | Uncompressed model | Compressed model(50% MAC reduction with SSVD+CP) |
---|---|---|
Resnet-50 (top-1 accuracy) | 76.05% | 75.75% |
Resinet-18 (top-1 accuracy) | 69.76% | 68.56% |
AIMET compression techniques reduces MACs by 50% while retaining accuracy within approximately 1% of the original model. |
Data-Free Quantization Through Weight Equalization and Bias Correction
Up or Down? Adaptive Rounding for Post-Training Quantization.
Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks.