Warning

This feature is under heavy development and API changes may occur without notice in future verions.

QuantizationMixin

class aimet_torch.v2.nn.QuantizationMixin(*args, **kwargs)[source]

Mixin that adds quantization functionality on top of regular pytorch modules.

QuantizationMixin provides all the same behavior as FakeQuantizationMixin, and by default, a quantized module behaves exactly the same as a fake-quantized version of the same torch.nn.Module. On top of this functionality, QuantizationMixin provides the ability to set custom quantized kernels which will be called in place of the floating-point pytorch operation in the forward pass.

input_quantizers

ModuleList containing QuantizerBase objects to be applied to the layer’s input tensors

Type

nn.ModuleList

output_quantizers

ModuleList containing QuantizerBase objects to be applied to the layer’s output tensors

Type

nn.ModuleList

param_quantizers

ModuleDict mapping parameter names to associated QuantizerBase objects

Type

nn.ModuleDict

Examples

>>> qlinear = QuantizedLinear(in_features=10, out_features=10, bias=False)
>>> print(qlinear)
QuantizedLinear(
  in_features=10, out_features=10, bias=False
  (param_quantizers): ModuleDict(
    (weight): None
  )
  (input_quantizers): ModuleList(
    (0): None
  )
  (output_quantizers): ModuleList(
    (0): None
  )
)
>>> linear = torch.nn.Linear(in_features=10, out_features=20, bias=True)
>>> qlinear = QuantizationMixin.from_module(linear)
>>> print(qlinear)
QuantizedLinear(
  in_features=10, out_features=20, bias=True
  (param_quantizers): ModuleDict(
    (weight): None
    (bias): None
  )
  (input_quantizers): ModuleList(
    (0): None
  )
  (output_quantizers): ModuleList(
    (0): None
  )
)
>>> qlinear.weight is linear.weight
True
abstract forward(*args, **kwargs)[source]

Computes a quantized version of the parent module’s forward method.

If no custom kernel has been set for the layer or the layer is called within its compute_encodings context, this will fall back to the fake-quantized forward pass used in the equivalent FakeQuantizationMixin module.

If a custom kernel implementation is available for the layer (i.e., get_kernel() does not return None), this method will perform the following logic:

  1. Apply existing input quantizers to input tensors

  2. Apply existing parameter quantizers to the layer’s parameters

  3. Call into the kernel retrieved by get_kernel(), passing the quantized inputs and parameters as well as the output encodings from output_quantizers

  4. Dequantize the output of the kernel call

__quant_init__()

Initializer for quantized module. This method will be invoked right after __init__().

This method initializes the input_quantizers, output_quantizers, and param_quantizers structures to the appropriate sizes based on the number of input tensors, output tensors, and parameters of the base nn.Module class. All quantizers are initializd to None.

For custom quantized classes, this method should be overridden to set the appropriate lengths of input_quantizers and output_quantizers for the given base class.

set_kernel(kernel)[source]

Set kernel for this instance of quantized module.

The function signature of this kernel must match the signature used in the forward() method. In general, this signature will follow the signature of the equivalent torch.nn.functional function, but should return a QuantizedTensor object and take in the additional keyword argument output_encodings.

Once set, the layer will call into kernel in the forward pass unless within the compute_encodings() context.

Parameters

kernel – Callable object to be used as the underlying kernel.

Example

>>> from aimet_torch.v2 import quantization as Q
>>> def int_multiply(a, b, output_encodings=None):
...     encodings = [a.encoding, b.encoding, output_encodings]
...     if not all(enc.mapping == "affine" for enc in encodings):
...             raise NotImplementedError
...     q_output = (a.quantized_repr() + a.encoding.offset) * (b.quantized_repr() + b.encoding.offset)
...     dq_output = q_output *  (a.encoding.scale * b.encoding.scale)
...     return Q.QuantizedTensor(output_encodings.quantize(dq_output), encoding=output_encodings)
...
>>> qmult = QuantizedMultiply()
>>> qmult.set_kernel(int_multiply)
classmethod set_default_kernel(kernel)[source]

Set default kernel for the class.

The function signature of this kernel must match the signature used in the quantized_forward() method. In general, this signature will follow the signature of the equivalent torch.nn.functional function, but should return a QuantizedTensor object and take in the additional keyword argument output_encodings.

Once set, all instances of cls will call into kernel in the forward pass unless:

  1. The instance is within the compute_encodings() context, or

  2. The kernel has been overridden by a set_kernel() call

Parameters

kernel – Callable object to be used as the default kernel by all the instances of this class.

Example

>>> from aimet_torch.v2 import quantization as Q
>>> def int_multiply(a, b, output_encodings=None):
...     encodings = [a.encoding, b.encoding, output_encodings]
...     if not all(enc.mapping == "affine" for enc in encodings):
...             raise NotImplementedError
...     q_output = (a.quantized_repr() + a.encoding.offset) * (b.quantized_repr() + b.encoding.offset)
...     dq_output = q_output *  (a.encoding.scale * b.encoding.scale)
...     return Q.QuantizedTensor(output_encodings.quantize(dq_output), encoding=output_encodings)
...
>>> QuantizedMultiply.set_default_kernel(int_multiply)
>>> qmult = QuantizedMultiply()
>>> qmult.get_kernel()
<function int_multiply at ...>
compute_encodings()[source]

Enters the compute_encodings() context for all QuantizerBase objects in the layer.

Inside this context, each quantizer will observe all inputs passed to the quantizer and will compute quantization encodings upon exiting the context.

Example

>>> qlinear = QuantizedLinear(10, 10)
>>> qlinear.output_quantizers[0] = Quantize((1, ), 8, symmetric=False)
>>> with qlinear.compute_encodings():
>>>     qlinear(torch.randn(16, 10))
>>> print(qlinear.output_quantizers[0].is_initialized())
True
classmethod from_module(module)

Create an instance of quantized module from a regular module instance.

The resulting quantized module contains the same attributes and parameters as the original module, but may be assigned input, output and parameter quantizers.

Parameters

module (Module) – Floating point module to quantize

Returns

Quantized version of the original module

Example

>>> linear = torch.nn.linear(10, 10)
>>> quantized_linear = FakeQuantizationMixin.from_module(linear)
>>> print(quantized_linear.weight is linear.weight)
True
>>> print(quantized_linear.param_quantizers)
ModuleDict(
    (weight): None
    (bias): None
)
classmethod get_default_kernel()[source]

Return the default kernel of the class

Return type

Optional[Callable]

Returns

Default kernel of the class. None if the default kernel is not set.

get_kernel()[source]

Return the kernel to be used by this instance of quantized module.

If the current instance does not have any kernel set, it will retrieve the default kernel of the class.

Return type

Optional[Callable]

Returns

The kernel to be used by this instance.

classmethod implements(module_cls)[source]

Decorator for registering quantized implementation of the given base class.