Warning
This feature is under heavy development and API changes may occur without notice in future versions.
GPTVQ
Top Level API
- aimet_torch.gptvq.gptvq_weight.GPTVQ.apply_gptvq(model, dummy_input, gptvq_params, param_encoding_path, module_names_to_exclude=None, block_level_module_names=None, file_name_prefix='gptvq', config_file_path=None)
Returns model with optimized weight rounding of GPTVQ supportable modules and saves the corresponding parameter quantization encodings to a separate JSON file that can be imported by QuantizationSimModel for inference or QAT
- Parameters:
model (
Module
) – PyTorch model to GPTVQdummy_input (
Union
[Tensor
,Tuple
]) – Dummy input to the model. Used to parse model graph. If the model has more than one input, pass a tuple. User is expected to place the tensors on the appropriate devicegptvq_params (
GPTVQParameters
) – Dataclass holding GPTVQ parametersparam_encoding_path (
str
) – Path where to store parameter encodingsmodule_names_to_exclude (
Optional
[List
[str
]]) – Module names which are excluded during GPTVQ optimizationblock_level_module_names (
Optional
[List
[List
[str
]]]) – List of module name lists to optimize block level GPTVQ optimization instead of leaf module levelfile_name_prefix (
str
) – Prefix to use for filename of the encodings fileconfig_file_path (
Optional
[str
]) – Configuration file path for model quantizers
- Return type:
Module
- Returns:
QuantizationSimModel with GPTVQ applied weights and saves corresponding parameter encodings JSON file at provided path
GPTVQ Parameters
- class aimet_torch.gptvq.defs.GPTVQParameters(data_loader, forward_fn, row_axis=0, col_axis=1, rows_per_block=32, cols_per_block=256, vector_dim=2, vector_bw=8, vector_stride=1, index_bw=6, num_of_kmeans_iterations=100, assignment_chunk_size=None)[source]
Data carrier containing GPTVQ parameters
Users should set dataloader and forward_fn that are used to layer-wise optimization in GPTVQParameters. All other parameters are optional and will be used as default values unless explicitly set
Code Example
This example shows how to use AIMET to perform GPTVQ
Load the model
For this example, we are going to load a pretrained OPT-125m model from transformers package. Similarly, you can load any pretrained PyTorch model instead.
from transformers import OPTForCausalLM
model = OPTForCausalLM.from_pretrained("facebook/opt-125m")
Apply GPTVQ
We can now apply GPTVQ to this model.
from aimet_torch.gptvq.defs import GPTVQParameters
from aimet_torch.gptvq.gptvq_weight import GPTVQ
def forward_fn(model, inputs):
return model(inputs[0])
args = GPTVQParameters(
dataloader,
forward_fn=forward_fn,
num_of_kmeans_iterations=100,
)
gptvq_applied_model = GPTVQ.apply_gptvq(
model=model,
dummy_input=torch.zeros(1, 2048, dtype=torch.long),
gptvq_params=args,
param_encoding_path="./data",
module_names_to_exclude=["lm_head"],
file_name_prefix="gptvq_opt",
)
Note that we set encoding path as ./data and file_name_prefix as gptvq_opt that will be used later when setting QuantizationSimModel
Create the Quantization Simulation Model from GPTVQ applied model
After GPTVQ optimization, we can get gptvq_applied_model object and corresponding encoding files from above step. To instantiate QuantizationSimModel with this information, users need to instantiate and load gptvq applied model and its encodings like below
from aimet_common.defs import QuantScheme
from aimet_torch.v2.quantsim import QuantizationSimModel
sim = QuantizationSimModel(
gptvq_applied_model,
dummy_input=dummy_input,
quant_scheme=QuantScheme.post_training_tf,
default_param_bw=args.vector_bw,
default_output_bw=16,
)
sim.load_encodings("./data/gptvq_opt.encodings", allow_overwrite=False)
Compute the Quantization Encodings
To compute quantization encodings of activations and parameters which were not optimized by GPTVQ, we can pass calibration data through the model and then subsequently compute the quantization encodings. Encodings here refer to scale/offset quantization parameters.
sim.compute_encodings(forward_fn, args.data_loader)
Export the model
GPTVQ requires additional information such as vector dimension, index bitwidth compared to general affine quantization. As a result, a new method of exporting encodings to json has been developed to both reduce the exported encodings file size as well as reduce the time needed to write exported encodings to the json file.
The following code snippet shows how to export encodings in the new 1.0.0 format:
from aimet_common import quantsim
# Assume 'sim' is a QuantizationSimModel object imported from aimet_torch.v2.quantsim
# Set encoding_version to 1.0.0
quantsim.encoding_version = '1.0.0'
sim.export('./data', 'exported_model', dummy_input)
The 1.0.0 encodings format is supported by Qualcomm runtime and can be used to export Per-Tensor, Per-Channel, Blockwise, LPBQ and Vector quantizer encodings. If Vector quantizers are present in the model, the 1.0.0 format must be used when exporting encodings for Qualcomm runtime.