AIMET ONNX Quant Analyzer API

AIMET ONNX Quant Analyzer analyzes the ONNX model and points out sensitive layers to quantization in the model. It checks model sensitivity to weight and activation quantization, performs per layer sensitivity and MSE analysis. It also exports per layer encodings min and max ranges and statistics histogram for every layer.

Top-level API

class aimet_onnx.quant_analyzer.QuantAnalyzer(model, dummy_input, forward_pass_callback, eval_callback)[source]

QuantAnalyzer provides following utilities:

  1. model sensitivity to weight and activation quantization

  2. per layer sensitivity analysis

  3. per layer encoding (min - max range)

  4. per layer quantizer historgram analysis and

  5. per layer MSE analysis

Parameters:
  • model (Union[ModelProto, ONNXModel]) – FP32 model to analyze for quantization.

  • dummy_input (Dict[str, ndarray]) – Dummy input to model.

  • forward_pass_callback (CallbackFunc) – A callback function for model calibration that simply runs forward passes on the model to compute encoding (delta/offset). This callback function should use representative data and should be subset of entire train/validation dataset (~1000 images/samples).

  • eval_callback (CallbackFunc) – A callback function for model evaluation that determines model performance. This callback function is expected to return scalar value representing the model performance evaluated against entire test/evaluation dataset.


QuantAnalyzer.enable_per_layer_mse_loss(unlabeled_dataset_iterable, num_batches)[source]

Enables per layer MSE loss analysis.

Parameters:
  • unlabeled_dataset_iterable (Iterable) – A collection (i.e. iterable with __len__) that iterates over an unlabeled dataset. The values yielded by this iterable are expected to be able to be passed directly to the model.

  • num_batches (int) – Number of batches. Approximately 256 samples/images are recommended, so if batch size of data loader is 64, then 4 number of batches leads to 256 samples/images.


QuantAnalyzer.analyze(quant_scheme=QuantScheme.post_training_tf_enhanced, default_param_bw=8, default_activation_bw=8, config_file=None, results_dir='./tmp/')[source]
Analyzes model for quantization and point out sensitive parts/hotspots of the model by performing
  1. model sensitivity to quantization,

  2. perform per layer sensitivity analysis by enabling and disabling quantizers,

  3. export per layer encodings min - max ranges,

  4. export per layer quantizer stats histogram,

  5. per layer MSE analysis

Parameters:
  • quant_scheme (QuantScheme) – Quantization scheme. Supported values are QuantScheme.post_training_tf or QuantScheme.post_training_tf_enhanced.

  • default_param_bw (int) – Default bitwidth (4-31) to use for quantizing layer parameters.

  • default_activation_bw (int) – Default bitwidth (4-31) to use for quantizing layer inputs and outputs.

  • config_file (Optional[str]) – Path to configuration file for model quantizers.

  • results_dir (str) – Directory to save the results.

Run specific utility

We can avoid running all the utilities that Quant Analyzer offers and only run those of our interest. For this we need to have the quantsim object which can be obtained from ‘create_quantsim_and_encodings()’. Then we call the desired Quant Analyzer utility of our interest and pass the quantsim object to it.

QuantAnalyzer.create_quantsim_and_encodings(quant_scheme, default_param_bw, default_activation_bw, config_file)[source]

Creates quantsim object and computes encodings.

Parameters:
  • quant_scheme (QuantScheme) – Quantization scheme.

  • default_param_bw (int) – Default bitwidth (4-31) to use for quantizing layer parameters.

  • default_activation_bw (int) – Default bitwidth (4-31) to use for quantizing layer inputs and outputs.

  • config_file (str) – Path to configuration file for model quantizers.

Return type:

QuantizationSimModel

Returns:

Quantsim object.


QuantAnalyzer.check_model_sensitivity_to_quantization(sim)[source]

Performs model sensitivity analysis to weight and activation quantization individually.

Parameters:

sim (QuantizationSimModel) – Quantsim model.

Return type:

Tuple[float, float, float]

Returns:

FP32 eval score, weight-quantized eval score, act-quantized eval score.


QuantAnalyzer.perform_per_layer_analysis_by_enabling_quantizers(sim, results_dir)[source]

Performs layer-wise quantization sensitivity analysis by enabling its quantizers

  1. All parameter and activation quantizers are disabled.

  2. For every layer, based on occurrence:
    1. Each layer’s parameters and activations quantizers are enabled as per JSON config file and set to bit-width specified.

    2. Measure and record eval score on subset of dataset.

    3. Disable enabled quantizers in step a.

  3. Returns dictionary containing layer name and corresponding eval score.

Parameters:
  • sim (QuantizationSimModel) – Quantsim model.

  • results_dir (str) – Directory to save the results.

Return type:

Dict

Returns:

layer wise eval score dictionary. dict[layer_name] = eval_score


QuantAnalyzer.perform_per_layer_analysis_by_disabling_quantizers(sim, results_dir)[source]

Performs layer-wise quantization sensitivity analysis by disabling its quantizers

  1. All parameter and activation quantizers are enabled as per JSON config file and set to bit-width specified.

  2. For every layer, based on occurrence:
    1. Each layer’s parameters and activations quantizers are disabled.

    2. Measure and record eval score on subset of dataset.

    3. Enable disabled quantizers in step a.

  3. Returns dictionary containing layer name and corresponding eval score.

Parameters:
  • sim (QuantizationSimModel) – Quantsim model.

  • results_dir (str) – Directory to save the results.

Return type:

Dict

Returns:

layer wise eval score dictionary. dict[layer_name] = eval_score


QuantAnalyzer.export_per_layer_encoding_min_max_range(sim, results_dir)[source]

Exports encoding min and max range for all weights and activations. results_dir has html files in following format.

-results_dir

-activations.html, -weights.html

If per channel quantization(PCQ) is enabled then,

-results_dir

-activations.html, -{layer_name}_{param_name}.html

Parameters:
  • sim (QuantizationSimModel) – Quantsim model.

  • results_dir (str) – Directory to save the results.

Return type:

Tuple[Dict, Dict]

Returns:

layer wise min-max range for weights and activations.


QuantAnalyzer.export_per_layer_stats_histogram(sim, results_dir)[source]

NOTE: Not to invoke when quantization scheme is not TF-Enhanced.

Exports histogram that represents a PDF of collected statistics by a quantizer. After invoking this API, results_dir should have html files in following format for every quantizers in the model.

-results_dir
-activations_pdf

name_{input/output}_{index}.html

-weights_pdf
-name

param_name_{channel_index}.html

Parameters:
  • sim (QuantizationSimModel) – Quantsim model.

  • results_dir (str) – Directory to save the results.


QuantAnalyzer.export_per_layer_mse_loss(sim, results_dir)[source]

Exports MSE loss between fp32 and quantized output activations for each layer.

Parameters:
  • sim (QuantizationSimModel) – Quantsim model.

  • results_dir (str) – Directory to save the results.

Return type:

Dict

Returns:

layer wise MSE loss. dict[layer_name] = MSE loss.


Code Examples

Required imports

from typing import Any
import numpy as np
from onnxruntime import InferenceSession

from aimet_common.defs import QuantScheme
from aimet_common.utils import CallbackFunc

from aimet_onnx.quant_analyzer import QuantAnalyzer

Prepare forward pass callback

# NOTE: In the actual use cases, the users should implement this part to serve
#       their own goals if necessary.
def forward_pass_callback(session: InferenceSession, _: Any = None) -> None:
    """
    NOTE: This is intended to be the user-defined model calibration function.
    AIMET requires the above signature. So if the user's calibration function does not
    match this signature, please create a simple wrapper around this callback function.

    A callback function for model calibration that simply runs forward passes on the model to
    compute encoding (delta/offset). This callback function should use representative data and should
    be subset of entire train/validation dataset (~1000 images/samples).

    :param session: OnnxRuntime Inference Session.
    :param _: Argument(s) of this callback function. Up to the user to determine the type of this parameter.
    E.g. could be simply an integer representing the number of data samples to use. Or could be a tuple of
    parameters or an object representing something more complex.
    """
    # User action required
    # User should create data loader/iterable using representative dataset and simply run
    # forward passes on the model.

Prepare eval callback

# NOTE: In the actual use cases, the users should implement this part to serve
#       their own goals if necessary.
def eval_callback(session: InferenceSession, _: Any = None) -> float:
    """
    NOTE: This is intended to be the user-defined model evaluation function.
    AIMET requires the above signature. So if the user's calibration function does not
    match this signature, please create a simple wrapper around this callback function.

    A callback function for model evaluation that determines model performance. This callback function is
    expected to return scalar value representing the model performance evaluated against entire
    test/evaluation dataset.

    :param session: OnnxRuntime Inference Session.
    :param _: Argument(s) of this callback function. Up to the user to determine the type of this parameter.
    E.g. could be simply an integer representing the number of data samples to use. Or could be a tuple of
    parameters or an object representing something more complex.
    :return: Scalar value representing the model performance.
    """
    # User action required
    # User should create data loader/iterable using entire test/evaluation dataset, perform forward passes on
    # the model and return single scalar value representing the model performance.
    return .8

Prepare model, callback functions and dataloader

    onnx_model = Model()

    input_shape = (1, 3, 224, 224)
    dummy_data = np.random.randn(*input_shape).astype(np.float32)
    dummy_input = {'input': dummy_data}

    # User action required
    # User should pass actual argument(s) of the callback functions.
    forward_pass_callback_fn = CallbackFunc(forward_pass_callback, func_callback_args=None)
    eval_callback_fn = CallbackFunc(eval_callback, func_callback_args=None)

    # User action required
    # User should use unlabeled dataloader, so if the dataloader yields labels as well user should discard them.
    unlabeled_data_loader = _get_unlabled_data_loader()

Create QuantAnalyzer object

    quant_analyzer = QuantAnalyzer(model=onnx_model,
                                   dummy_input=dummy_input,
                                   forward_pass_callback=forward_pass_callback_fn,
                                   eval_callback=eval_callback_fn)
    # Approximately 256 images/samples are recommended for MSE loss analysis. So, if the dataloader
    # has batch_size of 64, then 4 number of batches leads to 256 images/samples.
    quant_analyzer.enable_per_layer_mse_loss(unlabeled_dataset_iterable=unlabeled_data_loader, num_batches=4)

Run QuantAnalyzer

    quant_analyzer.analyze(quant_scheme=QuantScheme.post_training_tf_enhanced,
                           default_param_bw=8,
                           default_activation_bw=8,
                           config_file=None,
                           results_dir="./quant_analyzer_results/")