Post Training Quantization Techniques¶

Adaptive rounding ¶

Uses training data to improve accuracy over naïve rounding.

Sequential MSE (SeqMSE) is a method that searches for optimal quantization encodings per operation (i.e. per layer) such that the difference between the original output activation and the corresponding quantization-aware output activation is minimized.

Batch norm folding ¶

Folds BN layers into adjacent Convolution or Linear layers.

Cross-layer equalization ¶

Scales the parameter ranges across different channels to increase the range for layers with low range and reduce range for layers with high range, enabling the same quantization parameters to be used across all channels.

AdaScale ¶

AdaScale is a PTQ technique to improve accuracy of the quantized model by introducing learnable parameters in the weight quantizers and by performing BKD(Blockwise Knowledge Distillation) with respect to the corresponding FP output.

Batch norm re-estimation ¶

Re-estimated statistics are used to adjust the quantization scale parameters of preceding convolution or linear layers, effectively folding the BN layers.

Quantized LoRa ¶

Workflows to perform LoRa (Low-Rank Adaptation) on quantized large models.

OmniQuant ¶

OmniQuant is a PTQ technique to improve accuracy of the quantized model by introducing learnable parameter (scale) in the weight quantizers and by performing BKD(Blockwise Knowledge Distillation) with respect to the corresponding FP output.

Automatic quantization ¶

Analyzes the model, determines the best sequence of AIMET post-training quantization (PTQ) techniques, and applies these techniques.

SpinQuant ¶

SpinQuant is a PTQ technique which improves the accuracy of the quantized model by inserting rotations at specific points in the model to help with outliers in activation quantization.