This notebook contains a working example of AIMET adaptive rounding (AdaRound).
AIMET quantization features typically use the “nearest rounding” technique for achieving quantization. When using the nearest rounding technique, the weight value is quantized to the nearest integer value.
AdaRound optimizes a loss function using unlabeled training data to decide whether to quantize a specific weight to the closer integer value or the farther one. Using AdaRound, quantized accuracy is closer to the FP32 model than with nearest rounding.
Instantiate the example evaluation and training pipeline
Load the FP32 model and evaluate the model to find the baseline FP32 accuracy
Create a quantization simulation model (with fake quantization ops) and evaluate the quantized simuation model
Apply AdaRound and evaluate the simulation model to get a post-finetuned quantized accuracy score
Note
This notebook does not show state-of-the-art results. For example, it uses a relatively quantization-friendly model (Resnet18). Also, some optimization parameters like number of fine-tuning epochs are chosen to improve execution speed in the notebook.
This example does image classification on the ImageNet dataset. If you already have a version of the data set, use that. Otherwise download the data set, for example from https://image-net.org/challenges/LSVRC/2012/index .
Note
To speed up the execution of this notebook, you can use a reduced subset of the ImageNet dataset. For example: The entire ILSVRC2012 dataset has 1000 classes, 1000 training samples per class and 50 validation samples per class. However, for the purpose of running this notebook, you can reduce the dataset to, say, two samples per class.
Edit the cell below to specify the directory where the downloaded ImageNet dataset is saved.
[ ]:
DATASET_DIR = '/path/to/dataset/' # Replace this path with a real directory
1. Instantiate the example training and validation pipeline¶
Use the following training and validation loop for the image classification task.
Things to note:
AIMET does not put limitations on how the evaluation pipeline is written. AIMET creates an onnxruntime.InferenceSession for the quantized model, which can be run like a regular InferenceSession. sim.session can be used in place of the any other InferenceSession when doing inference/evaluation.
2. Convert an FP32 PyTorch model to ONNX, simplify & then evaluate baseline FP32 accuracy¶
2.1 Export a pretrained resnet18 model to onnx
You can load any pretrained PyTorch model instead.
[ ]:
from torchvision.models import resnet18
import onnx
input_shape = (1, 3, 224, 224) # Shape for each ImageNet sample is (3 channels) x (224 height) x (224 width)
dummy_input = torch.randn(input_shape)
filename = "./resnet18.onnx"
# Load a pretrained ResNet-18 model in torch
pt_model = resnet18(pretrained=True)
# Export the torch model to onnx
torch.onnx.export(pt_model.eval(),
dummy_input,
filename,
input_names=['input'],
output_names=['output'],
dynamic_axes={
'input' : {0 : 'batch_size'},
'output' : {0 : 'batch_size'},
}
)
model = onnx.load_model(filename)
2.2 (Optional) Simplify the onnx model
It is recommended to simplify the model before using AIMET as it can improve quantized accuracy and runtime performance.
[ ]:
from onnxsim import simplify
model, _ = simplify(model)
2.3 Decide whether to place the model on a CPU or CUDA device
This example uses CUDA if it is available. You can change this logic and force a device placement if needed.
[ ]:
# cudnn_conv_algo_search is fixing it to default to avoid changing in accuracies/outputs at every inference
if 'CUDAExecutionProvider' in ort.get_available_providers():
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
else:
providers = ['CPUExecutionProvider']
2.4 Create an InferenceSession and determine the model’s FP32 accuracy
3. Create a quantization simulation model and determine quantized accuracy¶
3.1 Fold BatchNormalization layers
Before calculating the simulated quantized accuracy using QuantizationSimModel, fold the BatchNormalization (BN) layers into adjacent Convolutional layers. The BN layers that cannot be folded are left as they are.
BN folding improves inference performance on quantized runtimes but can degrade accuracy on these platforms. This step simulates this on-target drop in accuracy.
Use the following code to call AIMET to fold the BN layers in-place on the given model
[ ]:
from aimet_onnx.batch_norm_fold import fold_all_batch_norms_to_weight
fold_all_batch_norms_to_weight(model)
3.2 Create a QuantizationSimModel
In this step, AIMET inserts fake quantization ops in the model graph and configures them.
Key parameters:
Setting activation_type to int8 performs all activation quantizations in the model using integer 8-bit precision
Setting param_type to int8 performs all parameter quantizations in the model using integer 8-bit precision
import copy
import aimet_onnx
from aimet_common.defs import QuantScheme
from aimet_onnx.quantsim import QuantizationSimModel
sim = QuantizationSimModel(model=copy.deepcopy(model),
quant_scheme=QuantScheme.min_max,
param_type=aimet_onnx.int8,
activation_type=aimet_onnx.int8,
providers=providers)
AIMET has added quantizer nodes to the model graph, but before the sim model can be used for inference or training, scale and offset quantization parameters must be calculated for each quantizer node by passing unlabeled data samples through the model to collect range statistics. This process is sometimes referred to as calibration. AIMET refers to it as “computing encodings”.
3.3 Pass unlabeled data samples through the model
The following code is one way get unlabeled samples for calibration. It uses the existing pytorch train or validation data loader and converts samples to an onnxruntime-compatible format.
A very small percentage of the data samples are needed. For example, the training dataset for ImageNet has 1M samples; 500 or 1000 suffice to compute encodings.
The samples should be reasonably well distributed. While it’s not necessary to cover all classes, avoid extreme scenarios like using only dark or only light samples. That is, using only pictures captured at night, say, could skew the results.
3.4 Evaluate the quantized model
You can pass sim.session to the eval function to evaluate the quantsim model.
[ ]:
# Evaluate the pre-adaround model
accuracy = evaluate(sim.session)
print(f"Pre-adaround sim accuracy {accuracy}")
inputs: is a collection (e.g., List[Dict[str,np.ndarray]]) of InferenceSession inputs for the model. Adaround needs a dataloader in order to use data samples to learn the rounding vectors.
iterations: is the number of iterations to apply to each layer. Default value is 10000, and we strongly recommend using at least this number. This example uses 32 to speed up execution.
[ ]:
# Apply adaround to the model weights
aimet_onnx.apply_adaround(sim, onnx_data, iterations=32)
4.2 Recompute activation encodings
Because adarounded weights may impact the distribution of activations in the model, it is recommended to recompute activation encodings after applying adaround.
[ ]:
# Recompute activation encodings (weight encodings are frozen)
sim.compute_encodings(onnx_data)
4.3 Evaluate the optimized sim
[ ]:
# Evaluate the ada-rounded model
accuracy = evaluate(sim.session)
print(f"Post-adaround sim accuracy: {accuracy}")
There might be little gain in accuracy after this limited application of Adaround. Experiment with the hyper-parameters to get better results.