Quantization workflow¶
This document outlines a clear approach and methodology to onboard, quantize and deploy any machine-learning models on Qualcomm® devices using the AI Model Efficiency Toolkit (AIMET).
Quantization features¶
AIMET toolkit offers following quantization features.
1. Quantization simulation¶
Quantization simulation (QuantSim) simulates quantized behavior using floating-point hardware. QuantSim efficiently enables various quantization options and helps you estimate the off-target quantized accuracy metric using quantization simulation (sequence of quantize and dequantize operations, known as QDQ) without requiring actual quantized hardware.
A quantization simulation workflow is illustrated here:
2. Post-training quantization¶
Post-training quantization (PTQ) techniques make a model more quantization-friendly without requiring model retraining or fine-tuning. PTQ is recommended as a go-to tool in a quantization workflow because:
PTQ does not require the training pipeline
PTQ is efficient and easy to use
The PTQ workflow is illustrated here:
3. Quantization-aware training¶
Quatization-aware training (QAT) enables you to fine-tune a model with quantization operations (QDQ) inserted in the model graph. In effect, it makes the model parameters robust to quantization noise.
Compared to PTQ:
QAT requires a training pipeline and dataset,
QAT takes longer because it needs some fine-tuning,
QAT requires hyper parameters search
but it can provide better accuracy, especially at lower bit-widths.
A typical QAT workflow is illustrated here:
Supported precisions for on-target inference¶
Before applying quantization techniques, you need to identify the supported precisions to run inference on desired target runtimes. For weights` and activations, supported precisions can be FP32, FP16, INT16, INT8 and INT4.
Some of the recent runtimes also support heterogeneous bit-width or mixed-precision, enabling sensitive operations to run at a higher precision within your model.
Supported precisions to run inference on target runtimes like Qualcomm® AI Engine Direct are:
Precision format |
Weights |
Activations |
---|---|---|
Floating-point (No quantization) |
FP16 |
FP16 |
Integer (quantized W8A16) |
INT8 |
INT16 |
Integer (quantized W8A8) |
INT8 |
INT8 |
Integer (quantized W4A8) |
INT4 |
INT8 |
Workflow¶
To decide which precision to run inference on target runtimes, you can follow the top-down approach where you begin with the highest precision (For example FP16) and transition to lower precision if necessary, which may require additional engineering effort.
Given that the off-target quantized accuracy using QuantSim is acceptable, following on-target metrics should be considered depending on your application.
Latency reduction and/or
Memory size reduction
If any of the above on-target metrics are not met for your use case, you should consider lowering the precision.
The figure below illustrates the recommended quantization workflow and the steps required to deploy the quantized model on the target device.
FP16 precision (No quantization)¶
Converting an FP32 model to FP16 precision without quantization is a recommended starting point. For more details on how to compile FP16 models for target runtimes, please refer to Qualcomm® AI Engine Direct documentation or Qualcomm® AI Hub documentation.
W16A16 verification¶
Before using quantized integer format, it’s important to ensure that the FP32 model and the quantized model (QuantSim object) perform similarly during the forward pass, especially when custom quantizers are included in the model.
Set the bit-width to 16 bits for both weights and activations when creating the QuantSim. Then, obtain the off-target quantized accuracy metric for the quantized model and verify if it aligns with the FP32 model. If it doesn’t, please report an issue to AIMET.
Apply PTQ or QAT at specified precision¶
If any of the metrics are not acceptable with higher precision, begin with weights at
INT8 precision and activations at INT16 precision. In this step, before creating the QuantSim,
ensure that the FP32 model adheres to model specific guidelines. For instance, in PyTorch,
QuantSim can only quantize math operations performed by torch.nn.Module
objects, while
torch.nn.functional
calls will be incorrectly ignored. Please refer to framework specific
pages to know more about such model guidelines.
If the off-target quantized accuracy metric is not meeting expectations, you can use PTQ or QAT techniques to improve the quantized accuracy for the desired precision. The decision between PTQ and QAT should be based on the quantized accuracy and runtime needs.
Once the off-target quantized accuracy metric is satisfactory, proceed to evaluate the on-target metrics at this precision. If the on-target metrics still do not meet the your requirements, consider further reducing the precision (for example W8A8, W4A8) and repeat the current step.
Deploy¶
Once the quantized accuracy and runtime requirements are achieved at the desired precision, the optimized model is ready for deployment on the target runtimes.