AI Model Efficiency Toolkit (AIMET)

pruning, quantization, network-compression, automl, deep-neural-networks, network-quantization, model-efficiency, open-source.

Open-sourcing our AI Model Efficiency Toolkit

Qualcomm Innovation Center (QuIC) is at the forefront of enabling low-power inference at the edge through its pioneering model-efficiency research. QuIC has a mission to help migrate the ecosystem toward fixed-point inference. With this goal, QuIC presents the AI Model Efficiency Toolkit (AIMET) - a library that provides advanced quantization and compression techniques for trained neural network models. AIMET enables neural networks to run more efficiently on fixed-point AI hardware accelerators.

Why AI Model Efficiency Toolkit?

Performance:

Quantized inference is significantly faster than floating point inference. For example, models that we’ve run on the Qualcomm® Hexagon™ DSP rather than on the Qualcomm® Kryo™ CPU have resulted in a 5x to 15x speedup. Plus, an 8-bit model also has a 4x smaller memory footprint relative to a 32-bit model. However, often when quantizing a machine learning model (e.g., from 32-bit floating point to an 8-bit fixed point value), the model accuracy is sacrificed. AIMET solves this problem through novel techniques like data-free quantization that provides state of the art INT8 results as shown in Data-Free Quantization paper ( ICCV’19).

Scalability:

Manual optimization of a neural network for improved efficiency is costly, time-consuming and not scalable with ever increasing AI workloads. AIMET solves this by providing a library that plugs directly into TensorFlow and PyTorch training frameworks for ease of use, allowing developers to call APIs directly from their existing pipelines.

PROCESS

How does it work?

Features

  • Quantization
  • Cross-Layer Equalization Equalize weight tensors to reduce amplitude variation across channels
  • Bias Correction Corrects shift in layer outputs introduced due to quantization
  • Quantization Simulation Simulate on-target quantized inference accuracy
  • Fine-tuning Use quantization sim to train the model further to improve accuracy
  • Compression
  • Spatial SVD Tensor-decomposition technique to split a large layer into two smaller ones
  • Channel Pruning Removes redundant input channels from a layer and reconstructs layer weights
  • Automatic selection of per-layer compression ratios Automatically selects how much to compress each layer in the model
  • Visualization
  • Visualize weight ranges
  • Visualize per-layer sensitivity to compression
OUR DATA

What performance benefits can you expect?

Through a series of simple API calls, AIMET can quantize an existing 32-bit floating-point model to an 8-bit fixed-point model without sacrificing much accuracy and without model fine-tuning. As an example of accuracy maintained, the DFQ method applied to several popular networks, such as MobileNet-v2 and ResNet-50, result in less than 0.9% loss in accuracy all the way down to 8-bit quantization — in an automated way without any training data.

Model FP32 model INT8 model with DFQ
Mobilenet-v2 (top-1 accuracy) 71.72% 71.08%
Resinet-50 (top-1 accuracy) 76.05% 75.45%
Deeplabv3 (mIoU) 72.65% 71.91%
Data-free quantization enables INT8 inference with very minimal loss in accuracy relative to the FP32 model.

Through a series of simple API calls, AIMET can also significantly compress models. For popular models, such as Resnet-50 and Resnet-18, compression with spatial SVD plus channel pruning achieves 50% MAC (multiply-accumulate) reduction while retaining accuracy within approximately 1% of the original uncompressed model.

Model (FP32) Uncompressed model Compressed model(50% MAC reduction with SSVD+CP)
Resnet-50 (top-1 accuracy) 76.05% 75.75%
Resinet-18 (top-1 accuracy) 69.76% 68.56%
AIMET compression techniques reduces MACs by 50% while retaining accuracy within approximately 1% of the original model.