Warning

This feature is under heavy development and API changes may occur without notice in future versions.

GPTVQ

Top Level API

aimet_torch.gptvq.gptvq_weight.GPTVQ.apply_gptvq(model, dummy_input, gptvq_params, param_encoding_path, module_names_to_exclude=None, block_level_module_names=None, file_name_prefix='gptvq', config_file_path=None)

Returns model with optimized weight rounding of GPTVQ supportable modules and saves the corresponding parameter quantization encodings to a separate JSON file that can be imported by QuantizationSimModel for inference or QAT

Parameters:

model (Module) – PyTorch model to GPTVQ
dummy_input (Union[Tensor, Tuple]) – Dummy input to the model. Used to parse model graph. If the model has more than one input, pass a tuple. User is expected to place the tensors on the appropriate device
gptvq_params (GPTVQParameters) – Dataclass holding GPTVQ parameters
param_encoding_path (str) – Path where to store parameter encodings
module_names_to_exclude (Optional[List[str]]) – Module names which are excluded during GPTVQ optimization
block_level_module_names (Optional[List[List[str]]]) – List of module name lists to optimize block level GPTVQ optimization instead of leaf module level
file_name_prefix (str) – Prefix to use for filename of the encodings file
config_file_path (Optional[str]) – Configuration file path for model quantizers

Return type:

Module

Returns:

QuantizationSimModel with GPTVQ applied weights and saves corresponding parameter encodings JSON file at provided path

GPTVQ Parameters

class aimet_torch.gptvq.defs.GPTVQParameters(data_loader, forward_fn, row_axis=0, col_axis=1, rows_per_block=32, cols_per_block=256, vector_dim=2, vector_bw=8, vector_stride=1, index_bw=6, num_of_kmeans_iterations=100, assignment_chunk_size=None)[source]: Data carrier containing GPTVQ parameters

Users should set dataloader and forward_fn that are used to layer-wise optimization in GPTVQParameters. All other parameters are optional and will be used as default values unless explicitly set

Code Example

This example shows how to use AIMET to perform GPTVQ

Load the model

For this example, we are going to load a pretrained OPT-125m model from transformers package. Similarly, you can load any pretrained PyTorch model instead.

from transformers import OPTForCausalLM

model = OPTForCausalLM.from_pretrained("facebook/opt-125m")

Apply GPTVQ

We can now apply GPTVQ to this model.

from aimet_torch.gptvq.defs import GPTVQParameters
from aimet_torch.gptvq.gptvq_weight import GPTVQ

def forward_fn(model, inputs):
    return model(inputs[0])

args = GPTVQParameters(
    dataloader,
    forward_fn=forward_fn,
    num_of_kmeans_iterations=100,
)

gptvq_applied_model = GPTVQ.apply_gptvq(
    model=model,
    dummy_input=torch.zeros(1, 2048, dtype=torch.long),
    gptvq_params=args,
    param_encoding_path="./data",
    module_names_to_exclude=["lm_head"],
    file_name_prefix="gptvq_opt",
)

Note that we set encoding path as ./data and file_name_prefix as gptvq_opt that will be used later when setting QuantizationSimModel

Create the Quantization Simulation Model from GPTVQ applied model

After GPTVQ optimization, we can get gptvq_applied_model object and corresponding encoding files from above step. To instantiate QuantizationSimModel with this information, users need to instantiate and load gptvq applied model and its encodings like below

from aimet_common.defs import QuantScheme
from aimet_torch.v2.quantsim import QuantizationSimModel

sim = QuantizationSimModel(
    gptvq_applied_model,
    dummy_input=dummy_input,
    quant_scheme=QuantScheme.post_training_tf,
    default_param_bw=args.vector_bw,
    default_output_bw=16,
)
sim.load_encodings("./data/gptvq_opt.encodings", allow_overwrite=False)

Compute the Quantization Encodings

To compute quantization encodings of activations and parameters which were not optimized by GPTVQ, we can pass calibration data through the model and then subsequently compute the quantization encodings. Encodings here refer to scale/offset quantization parameters.

sim.compute_encodings(forward_fn, args.data_loader)

Export the model

GPTVQ requires additional information such as vector dimension, index bitwidth compared to general affine quantization. As a result, a new method of exporting encodings to json has been developed to both reduce the exported encodings file size as well as reduce the time needed to write exported encodings to the json file.

The following code snippet shows how to export encodings in the new 1.0.0 format:

from aimet_common import quantsim

# Assume 'sim' is a QuantizationSimModel object imported from aimet_torch.v2.quantsim

# Set encoding_version to 1.0.0
quantsim.encoding_version = '1.0.0'
sim.export('./data', 'exported_model', dummy_input)

The 1.0.0 encodings format is supported by Qualcomm runtime and can be used to export Per-Tensor, Per-Channel, Blockwise, LPBQ and Vector quantizer encodings. If Vector quantizers are present in the model, the 1.0.0 format must be used when exporting encodings for Qualcomm runtime.