Qualcomm Innovation Center (QuIC) is at the forefront of enabling low-power inference at the edge through its pioneering model-efficiency research. QuIC has a mission to help migrate the ecosystem toward fixed-point inference. With this goal, QuIC presents the AI Model Efficiency Toolkit (AIMET) - a library that provides advanced quantization and compression techniques for trained neural network models. AIMET enables neural networks to run more efficiently on fixed-point AI hardware accelerators.
Quantized inference is significantly faster than floating point inference. For example, models that we’ve run on the Qualcomm® Hexagon™ DSP rather than on the Qualcomm® Kryo™ CPU have resulted in a 5x to 15x speedup. Plus, an 8-bit model also has a 4x smaller memory footprint relative to a 32-bit model. However, often when quantizing a machine learning model (e.g., from 32-bit floating point to an 8-bit fixed point value), the model accuracy is sacrificed. AIMET solves this problem through novel techniques like data-free quantization that provides state of the art INT8 results as shown in Data-Free Quantization paper ( ICCV’19).
Manual optimization of a neural network for improved efficiency is costly, time-consuming and not scalable with ever increasing AI workloads. AIMET solves this by providing a library that plugs directly into TensorFlow and PyTorch training frameworks for ease of use, allowing developers to call APIs directly from their existing pipelines.
Through a series of simple API calls, AIMET can quantize an existing 32-bit floating-point model to an 8-bit fixed-point model without sacrificing much accuracy and without model fine-tuning. As an example of accuracy maintained, the DFQ method applied to several popular networks, such as MobileNet-v2 and ResNet-50, result in less than 0.9% loss in accuracy all the way down to 8-bit quantization — in an automated way without any training data.
|Model||FP32 model||INT8 model with DFQ|
|Mobilenet-v2 (top-1 accuracy)||71.72%||71.08%|
|Resinet-50 (top-1 accuracy)||76.05%||75.45%|
|Data-free quantization enables INT8 inference with very minimal loss in accuracy relative to the FP32 model.|
Through a series of simple API calls, AIMET can also significantly compress models. For popular models, such as Resnet-50 and Resnet-18, compression with spatial SVD plus channel pruning achieves 50% MAC (multiply-accumulate) reduction while retaining accuracy within approximately 1% of the original uncompressed model.
|Model (FP32)||Uncompressed model||Compressed model(50% MAC reduction with SSVD+CP)|
|Resnet-50 (top-1 accuracy)||76.05%||75.75%|
|Resinet-18 (top-1 accuracy)||69.76%||68.56%|
|AIMET compression techniques reduces MACs by 50% while retaining accuracy within approximately 1% of the original model.|
Data-Free Quantization Through Weight Equalization and Bias Correction
Up or Down? Adaptive Rounding for Post-Training Quantization.
Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks.