Automatic quantization¶

Context¶

AIMET toolkit offers a suite of post-training quantization (PTQ) techniques. Often, applying these techniques in a specific sequence results in better quantized accuracy and performance.

The automatic quantization (AutoQuant) feature analyzes your trained model, determines the best sequence of AIMET PTQ quantization techniques, and applies these techniques. You can specify the tolerable accuracy drop in the AutoQuant API. As soon as this threshold accuracy is reached, AutoQuant stops applying PTQ quantization techniques.

Without the AutoQuant feature, you must manually try combinations of AIMET quantization techniques. This manual process is error-prone and time-consuming.

Workflow¶

The AutoQuant workflow is shown in the following figure.

Before entering the optimization workflow, AutoQuant prepares by:

Checking the validity of the model and converting the model into an AIMET quantization-friendly format (Prepare Model).
Selecting the best-performing quantization scheme for the given model (QuantScheme Selection)

After the preparation steps, AutoQuant proceeds to try four PTQ techniques:

BatchNorm folding
Cross-layer equalization (CLE)
Adaptive rounding (Adaround) (if enabled)
Automatic Mixed Precision (AMP) (if enabled)

These techniques are applied in a best-effort manner until the model meets the allowed accuracy drop. If applying AutoQuant fails to satisfy the evaluation goal, AutoQuant returns the model that gave the best results.

API¶

PyTorch

Top-level API

class aimet_torch.auto_quant.AutoQuantWithAutoMixedPrecision(model, dummy_input, data_loader, eval_callback, param_bw=8, output_bw=8, quant_scheme=QuantScheme.post_training_tf_enhanced, rounding_mode='nearest', config_file=None, results_dir='/tmp', cache_id=None, strict_validation=True, model_prepare_required=True)[source]

Integrate and apply post-training quantization techniques.

AutoQuant includes 1) batchnorm folding, 2) cross-layer equalization, 3) Adaround, and 4) Automatic Mixed Precision (if enabled). These techniques will be applied in a best-effort manner until the model meets the evaluation goal given as allowed_accuracy_drop.

Parameters:

model (Module) – Model to be quantized. Assumes model is on the correct device
dummy_input (Union[Tensor, Tuple]) – Dummy input for the model. Assumes that dummy_input is on the correct device
data_loader (DataLoader) – A collection that iterates over an unlabeled dataset, used for computing encodings
eval_callback (Callable[[Module], float]) – Function that calculates the evaluation score
param_bw (int) – Parameter bitwidth
output_bw (int) – Output bitwidth
quant_scheme (QuantScheme) – Quantization scheme
rounding_mode (str) – Rounding mode
config_file (Optional[str]) – Path to configuration file for model quantizers
results_dir (str) – Directory to save the results of PTQ techniques
cache_id (Optional[str]) – ID associated with cache results
strict_validation (bool) – Flag set to True by default.hen False, AutoQuant will proceed with execution and handle errors internally if possible. This may produce unideal or unintuitive results.
model_prepare_required (bool) – Flag set to True by default.If False, AutoQuant will skip model prepare block in the pipeline.

run_inference()[source]

Creates a quantization model and performs inference

Return type:: Tuple[QuantizationSimModel, float]
Returns:: QuantizationSimModel, model accuracy as float

optimize(allowed_accuracy_drop=0.0)[source]

Integrate and apply post-training quantization techniques.

Parameters:: allowed_accuracy_drop (float) – Maximum allowed accuracy drop
Return type:: Tuple[Module, float, str, List[Tuple[int, float, QuantizerGroup, Tuple]]]
Returns:: Tuple of (best model, eval score, encoding path, pareto front). Pareto front is None if AMP is not enabled or AutoQuant exits without performing AMP.

set_adaround_params(adaround_params)[source]

Set Adaround parameters. If this method is not called explicitly by the user, AutoQuant will use data_loader (passed to __init__) for Adaround.

Parameters:: adaround_params (AdaroundParameters) – Adaround parameters.
Return type:: None

set_export_params(onnx_export_args=-1, propagate_encodings=None)[source]

Set parameters for QuantizationSimModel.export.

Parameters:

onnx_export_args (OnnxExportApiArgs) – optional export argument with onnx specific overrides if not provide export via torchscript graph
propagate_encodings (Optional[bool]) – If True, encoding entries for intermediate ops (when one PyTorch ops results in multiple ONNX nodes) are filled with the same BW and data_type as the output tensor for that series of ops.

Return type:

None

set_mixed_precision_params(candidates, num_samples_for_phase_1=128, forward_fn=<function _default_forward_fn>, num_samples_for_phase_2=None)[source]

Set mixed precision parameters. NOTE: Automatic mixed precision will NOT be enabled unless this method is explicitly called by the user.

Parameters:

candidates (List[Tuple[Tuple[int, QuantizationDataType], Tuple[int, QuantizationDataType]]]) – List of tuples of candidate bitwidths and datatypes.
num_samples_for_phase_1 (Optional[int]) – Number of samples to be used for performance evaluation in AMP phase 1.
forward_fn (Callable) – Function that runs forward pass and returns the output tensor. which will be used for SQNR compuatation in phase 1. This function is expected to take 1) a model and 2) a single batch yielded from the data loader, and return a single torch.Tensor object which represents the output of the model. The default forward function is roughly equivalent to lambda model, batch: model(batch)
num_samples_for_phase_2 (Optional[int]) – Number of samples to be used for performance evaluation in AMP phase 2.

Return type:

None

set_model_preparer_params(modules_to_exclude=None, concrete_args=None)[source]

Set parameters for model preparer.

Parameters:

modules_to_exclude (Optional[List[Module]]) – List of modules to exclude when tracing.
concrete_args (Optional[Dict[str, Any]]) – Parameter for model preparer. Allows you to partially specialize your function, whether it’s to remove control flow or data structures. If the model has control flow, torch.fx won’t be able to trace the model. Check torch.fx.symbolic_trace API in detail.

get_quant_scheme_candidates()[source]

Return the candidates for quant scheme search. During optimize(), the candidate with the highest accuracy will be selected among them.

Return type:: Tuple[_QuantSchemePair, ...]
Returns:: Candidates for quant scheme search

set_quant_scheme_candidates(candidates)[source]

Set candidates for quant scheme search. During optimize(), the candidate with the highest accuracy will be selected among them.

Parameters:: candidates (Tuple[_QuantSchemePair, ...]) – Candidates for quant scheme search

TensorFlow

Top-level API

class aimet_tensorflow.keras.auto_quant_v2.AutoQuantWithAutoMixedPrecision(model, eval_callback, dataset, param_bw=8, output_bw=8, quant_scheme=QuantScheme.post_training_tf_enhanced, rounding_mode='nearest', config_file=None, results_dir='/tmp', cache_id=None, strict_validation=True)[source]¶

Integrate and apply post-training quantization techniques.

Parameters:

model (Model) – Model to be quantized. Assumes model is on the correct device
eval_callback (Callable[[Model, Optional[int]], float]) – A function that maps model and the number samples to the evaluation score. This callback is expected to return a scalar value representing the model performance evaluated against exactly N samples, where N is the number of samples passed as the second argument of this callback. NOTE: If N is None, the model is expected to be evaluated against the whole evaluation dataset.
dataset (DatasetV2) – An unlabeled dataset for encoding computation. By default, this dataset will be also used for Adaround unless otherwise specified by self.set_adaround_params
param_bw (int) – Parameter bitwidth
output_bw (int) – Output bitwidth
quant_scheme (QuantScheme) – Quantization scheme
rounding_mode (str) – Rounding mode
config_file (Optional[str]) – Path to configuration file for model quantizers
results_dir (str) – Directory to save the results of PTQ techniques
cache_id (Optional[str]) – ID associated with cache results
strict_validation (bool) – Flag set to True by default.hen False, AutoQuant will proceed with execution and handle errors internally if possible. This may produce unideal or unintuitive results.

run_inference()[source]¶

Creates a quantization model and performs inference

Return type:: Tuple[QuantizationSimModel, float]
Returns:: QuantizationSimModel, model accuracy as float

optimize(allowed_accuracy_drop=0.0)[source]¶

Integrate and apply post-training quantization techniques.

Parameters:: allowed_accuracy_drop (float) – Maximum allowed accuracy drop
Return type:: Tuple[Model, float, str, List[Tuple[int, float, QuantizerGroup, Tuple]]]
Returns:: Tuple of (best model, eval score, encoding path, pareto front). Pareto front is None if AMP is not enabled or AutoQuant exits without performing AMP.

set_adaround_params(adaround_params)[source]¶

Set Adaround parameters. If this method is not called explicitly by the user, AutoQuant will use dataset (passed to __init__) for Adaround.

Parameters:: adaround_params (AdaroundParameters) – Adaround parameters.

set_mixed_precision_params(candidates, num_samples_for_phase_1=128, forward_fn=<function _default_forward_fn>, num_samples_for_phase_2=None)[source]¶

Set mixed precision parameters. NOTE: Automatic mixed precision will NOT be enabled unless this method is explicitly called by the user.

Parameters:

candidates (List[Tuple[Tuple[int, QuantizationDataType], Tuple[int, QuantizationDataType]]]) – List of tuples of candidate bitwidths and datatypes.
num_samples_for_phase_1 (Optional[int]) – Number of samples to be used for performance evaluation in AMP phase 1.
forward_fn (Callable) – Function that runs forward pass and returns the output tensor. which will be used for SQNR compuatation in phase 1. This function is expected to take 1) a model and 2) a single batch yielded from the dataset, and return a single torch.Tensor object which represents the output of the model.
num_samples_for_phase_2 (Optional[int]) – Number of samples to be used for performance evaluation in AMP phase 2.

get_quant_scheme_candidates()[source]¶

Return the candidates for quant scheme search. During optimize(), the candidate with the highest accuracy will be selected among them.

Return type:: Tuple[_QuantSchemePair, ...]
Returns:: Candidates for quant scheme search

set_quant_scheme_candidates(candidates)[source]¶

Set candidates for quant scheme search. During optimize(), the candidate with the highest accuracy will be selected among them.

Parameters:: candidates (Tuple[_QuantSchemePair, ...]) – Candidates for quant scheme search

ONNX

Automatic quantization¶

Context¶

Workflow¶

Prerequisites¶

Procedure¶

Step 1¶

Step 2¶

Step 3¶

Step 4¶

Step 5¶

Step 6¶

Step 7¶

API¶