Quantization-aware training

This notebook contains a working example of AIMET Quantization-aware training (QAT). QAT is an AIMET feature that adds quantization simulation operations (also called fake quantization ops) to a trained ML model. A standard training pipeline is then used to train or fine-tune the model. The resulting model should show improved accuracy on quantized ML accelerators.

The quantization parameters (like encoding min/max, scale, and offset) for activations are computed once. During fine-tuning, the model weights are updated to minimize the effects of quantization in the forward pass, keeping the quantization parameters constant.

Overall flow

The example follows these high-level steps:

  1. Instantiate the example evaluation and training datasets

  2. Load the FP32 model and evaluate the model to find the baseline FP32 accuracy

  3. Create a quantization simulation model (with fake quantization ops) and evaluate the quantized simuation model

  4. Fine-tune the quantization simulation model and evaluate the fine-tuned simulation model, which should reflect the accuracy on a quantized ML platform

Note

This notebook does not show state-of-the-art results. For example, it uses a relatively quantization-friendly model (Resnet18). Also, some optimization parameters like number of fine-tuning epochs are chosen to improve execution speed in the notebook.


Dataset

This example does image classification on the ImageNet dataset. If you already have a version of the data set, use that. Otherwise download the data set, for example from https://image-net.org/challenges/LSVRC/2012/index .

Note

To speed up the execution of this notebook, you can use a reduced subset of the ImageNet dataset. For example: The entire ILSVRC2012 dataset has 1000 classes, 1000 training samples per class and 50 validation samples per class. However, for the purpose of running this notebook, you can reduce the dataset to, say, two samples per class.

Edit the cell below to specify the directory where the downloaded ImageNet dataset is saved.

[ ]:
DATASET_DIR = '/path/to/imagenet_dir'        # Replace this path with a real directory
BATCH_SIZE = 128
IMAGE_SIZE = (224, 224)

1. Instantiate the example evaluation and training datasets

Assign the training and validation dataset to dataset_train and dataset_valid respectively.

[ ]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import tensorflow as tf

dataset_train = dataset_valid = tf.keras.preprocessing.image_dataset_from_directory(
    directory=os.path.join(DATASET_DIR, "train"),
    labels="inferred",
    label_mode="categorical",
    batch_size=BATCH_SIZE,
    shuffle=True,
    image_size=IMAGE_SIZE
)
dataset_valid = tf.keras.preprocessing.image_dataset_from_directory(
    directory=os.path.join(DATASET_DIR, "val"),
    labels="inferred",
    label_mode="categorical",
    batch_size=BATCH_SIZE,
    shuffle=False,
    image_size=IMAGE_SIZE
)

2. Load the model and evaluate to get a baseline FP32 accuracy score

2.1 Load a pretrained ResNet50 model from Keras.

You can load any pretrained Keras model instead.

[ ]:
from tensorflow.keras.applications.resnet import ResNet50

model = ResNet50(weights="imagenet")
model.compile(optimizer="adam", loss="categorical_crossentropy")

2.2 Compute the floating point 32-bit (FP32) accuracy of this model using the evaluate() routine.

[ ]:
model.evaluate(dataset_valid)

3. Create a quantization simulation model and determine quantized accuracy

Fold Batch Normalization layers

Before calculating the simulated quantized accuracy using QuantizationSimModel, fold the BatchNormalization (BN) layers into adjacent Convolutional layers. The BN layers that cannot be folded are left as they are.

BN folding improves inference performance on quantized runtimes but can degrade accuracy on these platforms. This step simulates this on-target drop in accuracy.

The following code calls AIMET to fold the BN layers of a given model. NOTE: During folding, a new model is returned. Please use the returned model for the rest of the pipeline.

3.1 Use the following code to call AIMET to fold the BN layers on the model.

Note

Folding returns a new model. Use the returned model for the rest of the pipeline.

[ ]:
from aimet_tensorflow.keras.batch_norm_fold import fold_all_batch_norms

_, model = fold_all_batch_norms(model)

Create the Quantization Sim Model

3.2 Use AIMET to create a QuantizationSimModel.

In this step, AIMET inserts fake quantization ops in the model graph and configures them.

Key parameters:

  • Setting default_output_bw to 8 performs all activation quantizations in the model using integer 8-bit precision

  • Setting default_param_bw to 8 performs all parameter quantizations in the model using integer 8-bit precision

See QuantizationSimModel in the AIMET API documentation for a full explanation of the parameters.

[ ]:
from aimet_tensorflow.keras.quantsim import QuantizationSimModel
from aimet_common.defs import QuantScheme

sim = QuantizationSimModel(model=model,
                           quant_scheme=QuantScheme.post_training_tf,
                           rounding_mode="nearest",
                           default_output_bw=8,
                           default_param_bw=8)

AIMET has added quantizer nodes to the model graph, but before the sim model can be used for inference or training, scale and offset quantization parameters must be calculated for each quantizer node by passing unlabeled data samples through the model to collect range statistics. This process is sometimes referred to as calibration. AIMET refers to it as “computing encodings”.

3.3 Create a routine to pass unlabeled data samples through the model.

The following code is one way to write a routine that passes unlabeled samples through the model to compute encodings. It uses the existing train or validation data loader to extract samples and pass them to the model. Since there is no need to compute loss metrics, it ignores the model output.

[ ]:
from tensorflow.keras.utils import Progbar
from tensorflow.keras.applications.resnet import preprocess_input

def pass_calibration_data(sim_model, samples):
    dataset = dataset_valid

    progbar = Progbar(samples)

    batch_cntr = 0
    for inputs, _ in dataset:
        sim_model(preprocess_input(inputs))

        batch_cntr += 1
        progbar_stat_update = \
            batch_cntr * BATCH_SIZE if (batch_cntr * BATCH_SIZE) < samples else samples
        progbar.update(progbar_stat_update)
        if (batch_cntr * BATCH_SIZE) > samples:
            break

A few notes regarding the data samples:

  • A very small percentage of the data samples are needed. For example, the training dataset for ImageNet has 1M samples; 500 or 1000 suffice to compute encodings.

  • The samples should be reasonably well distributed. While it’s not necessary to cover all classes, avoid extreme scenarios like using only dark or only light samples. That is, using only pictures captured at night, say, could skew the results.


3.4 Call AIMET to pass data through the model and compute the quantization encodings.

Encodings here refer to scale and offset quantization parameters.

[ ]:
sim.compute_encodings(forward_pass_callback=pass_calibration_data,
                      forward_pass_callback_args=1000)

3.5 Compile the model.

[ ]:
sim.model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

The QuantizationSim model is now ready to be used for inference or training.

3.6 Pass the model to the evaluation routine to calculate a simulated quantized accuracy score.

[ ]:
sim.model.evaluate(dataset_valid)

4. Perform QAT

4.1 To perform quantization aware training (QAT), train the model for a few more epochs (typically 15-20).

As with any training job, hyper-parameters need to be searched for optimal results. Good starting points are to use a learning rate on the same order as the ending learning rate when training the original model, and to drop the learning rate by a factor of 10 every 5 epochs or so.

This example trains for only 1 epoch, but you can experiment with the parameters however you like.

[ ]:
quantized_callback = tf.keras.callbacks.TensorBoard(log_dir="./log/quantized")
history = sim.model.fit(dataset_train, epochs=1, validation_data=dataset_valid, callbacks=[quantized_callback])

4.2 After QAT finishes, run quantization simulation inference against the validation dataset to see improvements in accuracy.

[ ]:
sim.model.evaluate(dataset_valid)

Of course, there might be little gain in accuracy after only one epoch of training. Experiment with the hyper-parameters to get better results.

Next steps

The next step is to export this model for installation on the target.

Export the model and encodings.

  • Export the model with the updated weights but without the fake quant ops.

  • Export the encodings (scale and offset quantization parameters). AIMET QuantizationSimModel provides an export API for this purpose.

The following code performs these exports.

[ ]:
sim.compute_encodings(forward_pass_callback=pass_calibration_data,
                      forward_pass_callback_args=1000)
sim.export('./data', 'model_after_qat')

For more information

See the AIMET API docs for details about the AIMET APIs and optional parameters.

See the other example notebooks to learn how to use QAT with range-learning and other AIMET post-training quantization techniques.