aimet_onnx.auto_quant_v2¶
Top-level API
- class aimet_onnx.auto_quant_v2.AutoQuantWithAutoMixedPrecision(model, dummy_input, data_loader, eval_callback, param_bw=8, output_bw=8, quant_scheme=QuantScheme.post_training_tf_enhanced, rounding_mode='nearest', use_cuda=True, device=0, config_file=None, results_dir='/tmp', cache_id=None, strict_validation=True)[source]¶
Integrate and apply post-training quantization techniques.
AutoQuant includes 1) batchnorm folding, 2) cross-layer equalization, 3) Adaround, and 4) Automatic Mixed Precision (if enabled). These techniques will be applied in a best-effort manner until the model meets the evaluation goal given as allowed_accuracy_drop.
- Parameters:
model (
ONNXModel
) – Model to be quantized.dummy_input (
Dict
[str
,ndarray
]) – Dummy input dict for the model.data_loader (
DataLoader
) – A collection that iterates over an unlabeled dataset, used for computing encodingseval_callback (
Callable
[[InferenceSession
,int
],float
]) – Function that calculates the evaluation score given the model sessionparam_bw (
int
) – Parameter bitwidthoutput_bw (
int
) – Output bitwidthquant_scheme (
QuantScheme
) – Quantization schemerounding_mode (
str
) – Rounding modeuse_cuda (
bool
) – True if using CUDA to run quantization op. False otherwise.config_file (
Optional
[str
]) – Path to configuration file for model quantizersresults_dir (
str
) – Directory to save the results of PTQ techniquescache_id (
Optional
[str
]) – ID associated with cache resultsstrict_validation (
bool
) – Flag set to True by default.When False, AutoQuant will proceed with execution and handle errors internally if possible. This may produce unideal or unintuitive results.
- run_inference()[source]¶
Creates a quantization model and performs inference
- Return type:
Tuple
[QuantizationSimModel
,float
]- Returns:
QuantizationSimModel, model accuracy as float
- optimize(allowed_accuracy_drop=0.0)[source]¶
Integrate and apply post-training quantization techniques.
- Parameters:
allowed_accuracy_drop (
float
) – Maximum allowed accuracy drop- Return type:
Tuple
[ONNXModel
,float
,str
,List
[Tuple
[int
,float
,QuantizerGroup
,Tuple
]]]- Returns:
Tuple of (best model, eval score, encoding path, pareto front). Pareto front is None if AMP is not enabled or AutoQuant exits without performing AMP.
- set_adaround_params(adaround_params)[source]¶
Set Adaround parameters. If this method is not called explicitly by the user, AutoQuant will use data_loader (passed to __init__) for Adaround.
- Parameters:
adaround_params (
AdaroundParameters
) – Adaround parameters.- Return type:
None
- set_mixed_precision_params(candidates, num_samples_for_phase_1=128, forward_fn=<function _default_forward_fn>, num_samples_for_phase_2=None)[source]¶
Set mixed precision parameters. NOTE: Automatic mixed precision will NOT be enabled unless this method is explicitly called by the user.
- Parameters:
candidates (
List
[Tuple
[Tuple
[int
,QuantizationDataType
],Tuple
[int
,QuantizationDataType
]]]) – List of tuples of candidate bitwidths and datatypes.num_samples_for_phase_1 (
Optional
[int
]) – Number of samples to be used for performance evaluation in AMP phase 1.forward_fn (
Callable
) – Function that runs forward pass and returns the output tensor. which will be used for SQNR compuatation in phase 1. This function is expected to take 1) a model and 2) a single batch yielded from the data loader, and return a single np.ndarray object which represents the output of the model.num_samples_for_phase_2 (
Optional
[int
]) – Number of samples to be used for performance evaluation in AMP phase 2.
- Return type:
None