Encoding Format Specification¶
AIMET Quantization Simulation determines scale/offset values for activation and parameter tensors in the
model. This scale/offset information is also referred to as ‘quantization encoding’. When a model is
exported using QuantizationSimModel.export()
API, an encoding file is also exported that contains
quantization encodings for the model. This encoding file can then be used by a target runtime like Qualcomm® AI Engine Direct
when running the model on-target.
The encoding file uses a JSON syntax. The file format is usable with model exports from aimet_torch
, aimet_tensorflow
, and aimet_onnx
.
1. Versioning¶
Encoding format will follow XX.YY.ZZ
versioning format as describe below,
XX
= Major RevisionYY
= Minor RevisionZZ
= Patching version
A change in major revision indicates substantial change to the format, updates to minor version indicates additional information element being added to encoding format and might require update to fully consume the encodings. The patching version shall be updated to indicate minor updates to quantization simulation e.g. bug fix etc.
2. Version 0.6.1¶
2.1. Encoding specification¶
“version”: “string”
“activation_encodings”:
{
<tensor_name>: [Encoding, …]
}
“param_encodings”
{
<tensor_name>: [Encoding, …]
}
"quantizer_args":
{
"activation_bitwidth": integer,
"dtype": string,
"is_symmetric": string,
"param_bitwidth": integer,
"per_channel_quantization": string,
"quant_scheme": "string"
}
Where,
"version”
is set to “0.6.1”<tensor_name>
is a string representing the tensor in onnx or tensorflow graph.
‘Encoding’
structure shall include an encoding field “dtype”
to specify the datatype used for simulating the tensor.
Encoding:{
dtype: string
bitwidth: integer
is_symmetric: string
max: float
min: float
offset: integer
scale: float
}
Where,
dtype
: allowed choicesint
,float
bitwidth
: constraints >=4 and <=32is_symmetric
: allowed choicesTrue
,False
when dtype
is set to ‘float’
, Encoding shall have the following fields
Encoding:{
dtype: string
bitwidth: integer
}
bitwidth
defines the precision of the tensor being generated by the producer and consumed by the
downstream consumer(s).
The quantizer_args
structure describes the settings used to configure the quantization simulation model, and contains
usable information about how encodings were computed.
The field is auto-populated and should not require a manual edit from users. It can be broken down as follows:
activation_bitwidth
: Indicates the bit-width set for all activation encodings.dtype
: Indicates if computation occurred in floating point or integer precision.is_symmetric
: If set to true, it indicates that parameter encodings were computed symmetrically.param_bitwidth
: Indicates the bit-width set for all parameter encodings.per_channel_quantization
: If set to True, then quantization encodings were computed for each channel axis of the tensor.quant_scheme
: Indicates the quantization algorithm used, which may be one of post_training_tf or post_training_tf_enhanced.
The intended usage of quantizer_args
is to provide debugging information for customers who may need to perform
post-quantization tasks, which could benefit from knowledge of how the encoding information was obtained.
3. Version 1.0.0¶
Note
Encoding version 1.0.0 is only supported for aimet_torch
and aimet_onnx
.
Note
The default encoding version is 0.6.1
. To export with 1.0.0
, add from aimet_common import quantsim
and quantsim.encoding_version = '1.0.0'
prior to sim.export().
Changes from 0.6.1:
Activation and parameter encodings are now no longer dictionaries mapping tensor names to encoding dictionaries, but instead are lists of encoding dictionaries where the tensor names are another entry in the encoding dictionary.
Fields present in the encoding dictionary have been reworked or removed for conciseness. Refer to the table below for details on which fields are present for each encoding type.
Notably, per channel encodings are now contained in a single encoding dictionary instead of a list of encodings with length num_channels. Instead,
scale
andoffset
fields are now 1-D arrays of length num_channels.Encodings for per-block quantization and Low Power Blockwise Quantization are now supported.
3.1. Encoding specification¶
Key |
Value type |
Description |
---|---|---|
version |
string |
Encoding file version |
activation_encodings |
list of Encoding dictionaries |
Encodings for each activation tensor |
param_encodings |
list of Encoding dictionaries |
Encodings for each param tensor |
quantizer_args |
dict |
Arguments used to instantiate QuantizationSimModel (refer to Quantizer Args structure for details) |
excluded_layers |
list |
List of excluded layers |
The below table describes how the Encoding dictionary looks for different quantization types: Per Tensor, Per Channel, Per Block, and Low Power Blockwise Quantization (LPBQ). Certain keys will only be present for certain quantization types, as indicated in the table.
Key |
Value type |
Description |
Per Tensor |
Per Channel |
Per Block |
LPBQ |
---|---|---|---|---|---|---|
name |
string |
Tensor name |
X |
X |
X |
X |
enc_type |
string |
Encoding type (refer to EncodingType for valid strings) |
X |
X |
X |
X |
dtype |
string |
Data type (refer to DataType for valid strings) |
X |
X |
X |
X |
block_size |
uint32 |
Block size |
X (INT only) |
X |
||
bw |
uint8 |
Encoding bw (>=4 and <= 32) |
X |
X |
X |
X |
is_sym |
bool |
True if encoding is symmetric, False otherwise |
X |
X |
X |
X |
scale |
fp32[] |
Flattened array of scales |
X (INT only) |
X (INT only) |
X (INT only) |
X |
offset |
int32[] |
Flattened array of offsets |
X (INT only) |
X (INT only) |
X (INT only) |
X |
compressed_bw |
uint8 |
Compressed bw |
X |
|||
per_block_int_scale |
uint16[] |
Flattened array of scales per block |
X |
Enum |
Description |
---|---|
PER_TENSOR |
Denotes Per Tensor quantization |
PER_CHANNEL |
Denotes Per Channel quantization |
PER_BLOCK |
Denotes Per Block quantization |
LPBQ |
Denotes LPBQ quantization |
Enum |
Description |
---|---|
INT |
Denotes integer quantization |
FLOAT |
Denotes floating point quantization |
Key |
Value type |
Description |
---|---|---|
activation_bitwidth |
uint8 |
Indicates the bit-width set for all activation encodings |
dtype |
string |
Indicates if computation occurred in floating point or integer precision |
is_symmetric |
bool |
If set to true, it indicates that parameter encodings were computed symmetrically |
param_bitwidth |
uint8 |
Indicates the bit-width set for all parameter encodings |
per_channel_quantization |
bool |
If set to True, then quantization encodings were computed for each channel axis of the tensor |
quant_scheme |
string |
Indicates the quantization algorithm used, which may be one of post_training_tf or post_training_tf_enhanced |
For Per Channel quantization, the channel axis is defined to be the output channel dimension. For Per Block quantization, the channel axis is the output channel dimension while the block axis is the input channel dimension.
For Per Tensor quantization, scales and offsets will be a 1-D array of length 1. For Per Channel quantization, the length will be the the number of output channels. For Per Block quantization, the length will be
number of output channels
xnumber of input channels / block size