aimet_torch.peft¶
This document provides steps for integrating LoRA adapters with AIMET Quantization flow. LoRA adapters are used to enhance the efficiency of fine-tuning large models with reduced memory usage. We will use PEFT library from HuggingFace to instantiate our model and add adapters to it.
By integrating adapters with AIMET quantization, we can perform similar functionalities as present in PEFT, for example, changing adapter weights, enabling and disabling adapters. Along with this, we can tweak the quantization parameters for the adapters alone to get good quantization accuracy.
User flow¶
The user can use the following flow to quantize a model with LoRA adapters.
Step 1: Create a PEFT model with one adapter. Use PEFT APIs from HuggingFace to create a PEFT model
>>> from peft import LoraConfig, get_peft_model
>>> lora_config = LoraConfig(
>>> lora_alpha=16,
>>> lora_dropout=0.1,
>>> r=4,
>>> bias="none",
>>> target_modules=["linear"])
>>> model = get_peft_model(model, lora_config)
Step 2: Replace lora layers with AIMET lora layers. This API helps AIMET quantize the lora layers
>>> from aimet_torch.peft import replace_lora_layers_with_quantizable_layers
>>> replace_lora_layers_with_quantizable_layers(model)
Step 3: Track meta data for lora layers such as adapter name, lora layer names & alpha param
>>> from aimet_torch.peft import track_lora_meta_data
>>> meta_data = track_lora_meta_data(model, tmp_dir, 'meta_data')
>>> ## If linear lora layers were replaced with ConvInplaceLinear then
>>> meta_data = track_lora_meta_data(model, tmp_dir, 'meta_data', ConvInplaceLinear)
Step 4: Create Quantization utilities
>>> from aimet_torch.peft import PeftQuantUtils
>>> peft_utils = PeftQuantUtils(meta_data)
>>> ## If we are using a prepared model, then load name to module dict that gets saved as a json file
>>> peft_utils = PeftQuantUtils(meta_data, name_to_module_dict)
Next step will be to Prepare the model and create a QuantSim object (steps are not shown below, please refer to model preparer and quantsim docs for reference) Once Sim is created, we can use peft_utils to modify quantization attributes for lora layers in sim
Step 5: Disable lora adapters. To compute base model encodings without the effect of adapters we need to disable lora adapters.
>>> peft_utils.disable_lora_adapters(sim)
Step 6: Compute Encodings for sim (Not shown below, refer to quantsim docs) & freeze base model encodings for params. (The step for computing the encoding for a model is not shows here). Since the base model weights are common across different adapters, we don’t need to recompute the encodings for them. Therefore, to speed up computation we freeze the base model params
>>> peft_utils.freeze_base_model_param_quantizers(sim)
Step 7: Export base model and encodings
>>> sim.export(tmpdir, 'model', dummy_input=dummy_inputs, export_model=True, filename_prefix_encodings='base_encodings')
Step 8: Load adapter weights for adapter 1
>>> peft_utils.enable_adapter_and_load_weights(sim, 'tmpdir/lora_weights_after_adaptation_for_adapter1.safetensor', use_safetensor=True)
Step 9: Configure lora adapter quantizers
>>> for name, lora_module in peft_utils.get_quantized_lora_layer(sim):
>>> ### Change bitwidth
>>> lora_module.param_quantizers['weight'].bitwidth = 16
>>> ### Change per tensor to per channel
>>> lora_module.param_quantizers['weight'] = aimet.quantization.affine.QuantizeDequantize(shape=(1, 1, 1, 1), bitwidth=16, symmetric=True).to(module.weight.device)
- Step 10: Compute encodings for model & Export
Here we do not show steps for how to compute the encoding. Please refer to Quantization simulation documentation Note: while exporting the model directory should be the same for base_model export and consecutive exports
>>> sim.export(tmpdir, 'model', dummy_input=dummy_inputs, export_model=False, filename_prefix_encodings='adapter1') >>> peft_utils.export_adapter_weights(sim, tmpdir, 'adapter1_weights')
Step 11: For another adapter with same configration (rank & target module) repeat steps 8-10
API¶
- class aimet_torch.peft.AdapterMetaData[source]¶
Tracks meta data for lora layers. Tracks names of lora_a & b as well as alpha values .. attribute:: lora_A, lora_B, alpha
The following API can be used to replace PEFT lora layers definition with AIMET lora layers definition
- peft.replace_lora_layers_with_quantizable_layers()¶
Utility to replace lora layers with Quantizable Lora layers
- Parameters:
model (
Module
) – PEFT model
The following API can be used to track lora meta data. To be passed to peft utilities
- peft.track_lora_meta_data(path, filename_prefix, replaced_module_type=None)¶
Utility to track and save meta data for adapters. The meta data has adapter names and corresponding lora layers & alphas
- Parameters:
model (
Module
) – PEFT modelpath (
str
) – path where to store model pth and encodingsfilename_prefix (
str
) – Prefix to use for filenamesreplaced_module_type (
Optional
[Type
[Module
]]) – If lora linear layer is replaced by another torch module, then replaced_module_type represents the type with which linear layer was replaced. Otherwise pass None
- Return type:
Dict
[str
,AdapterMetaData
]
- class aimet_torch.peft.PeftQuantUtils(adapater_name_to_meta_data, name_to_module_dict=None)[source]¶
Utilities for quantizing peft model
Init for Peft utilities for quantization
- Parameters:
adapater_name_to_meta_data (
Dict
[str
,AdapterMetaData
]) – Dict mapping adapter name to meta data. Output of track_meta_dataname_to_module_dict – PT Name to module prepared model name mapping
- disable_lora_adapters(sim)[source]¶
Disables adapter (zero out weights for lora A & B) effect on base model by loading weights to model
- Parameters:
sim (
QuantizationSimModel
) – QuantSim model
- enable_adapter_and_load_weights(sim, adapter_weights_path, use_safetensor=True)[source]¶
Enables adapter effect on base model by loading weights to model
- Parameters:
sim (
QuantizationSimModel
) – QuantSim modeladapter_weights_path – Path to adapter weights (adapter weights should be either bin file or safetensor)
use_safetensor (
bool
) – True if adapter weights path point to a safetensor file. False if points to bin file
- export_adapter_weights(sim, path, filename_prefix)[source]¶
Exports adapter weights to safetensor format
- Parameters:
sim (
QuantizationSimModel
) – QuantSim modelpath (
str
) – path where to store model pth and encodingsfilename_prefix (
str
) – Prefix to use for filenames of the model pth and encodings files
- freeze_base_model(sim)[source]¶
Freeze entire base model
- Parameters:
sim (
QuantizationSimModel
) – QuantSim model
- freeze_base_model_activation_quantizers(sim)[source]¶
Freeze activation quantizers of base model
- Parameters:
sim (
QuantizationSimModel
) – QuantSim model
- freeze_base_model_param_quantizers(sim)[source]¶
Freeze parameter quantizers of base model
- Parameters:
sim (
QuantizationSimModel
) – QuantSim model
- get_fp_lora_layer(model)[source]¶
This Function can be used to get lora layers for a model
- Parameters:
model – FP32 model
- get_quantized_lora_layer(sim)[source]¶
This function can be used to generate lora quantized layers Use cases: 1) New quantizers can be created and assigned to lora quantized layer.
New quantizers may be required if changing - Changing dtype, per channel to per tensor and vice versa 2) Assign new values to symmetric, bitwidth
- Parameters:
sim (
QuantizationSimModel
) – QuantSim model
- quantize_lora_scale_with_fixed_range(sim, bitwidth, scale_min=0, scale_max=1e-05)[source]¶
Add input quantizer for scale(alpha/rank) and provide min max values to it
- Parameters:
sim – QuantSim model
bitwidth – Bitwidth for input quantizer to Mul/ bitwidth for scale
scale_min – min value of lora alpha to be used
scale_max – max value of lora alpha to be used
- set_bitwidth_for_lora_adapters(sim, output_bw, param_bw)[source]¶
Sets output and param bitwidth for all Lora adapters added to the model
- Parameters:
sim (
QuantizationSimModel
) – QuantSim modeloutput_bw (
int
) – Output BWparam_bw (
int
) – Parameter BW