Cross-Layer Equalization (CLE) and Bias Correction (BC)¶
This notebook showcases a working code example of how to use AIMET to apply Cross-Layer Equalization (CLE) and Bias Correction (BC). CLE and BC are post-training quantization techniques that aim to improve quantized accuracy of a given model. CLE does not need any data samples. BC may optionally need unlabelled data samples. These techniques help recover quantized accuracy when the model quantization is sensitive to parameter quantization as opposed to activation quantization.
To learn more about this techniques, please refer to the “Data-Free Quantization Through Weight Equalization and Bias Correction” paper from ICCV 2019 - https://arxiv.org/abs/1906.04721
Cross-Layer Equalization AIMET performs the following steps when running CLE: 1. Batch Norm Folding: Folds BN layers into Conv layers immediate before or after the Conv layers. 2. Cross-Layer Scaling: Given a set of consecutive Conv layers, equalizes the range of tensor values per-channel by scaling up/down per-channel weight tensor values of a layer and corresponding scaling down/up per-channel weight tensor values of the subsequent layer. 3. High Bias Folding: Cross-layer scaling may result in high bias parameter values for some layers. This technique folds some of the bias of a layer into the subsequent layer’s parameters.
Overall flow¶
This notebook covers the following 1. Instantiate the example evaluation and training pipeline 2. Load the FP32 model and evaluate the model to find the baseline FP32 accuracy 3. Create a quantization simulation model (with fake quantization ops inserted) and evaluate this simuation model to get a quantized accuracy score 4. Apply CLE, BC and and evaluate the simulation model to get a post-finetuned quantized accuracy score
What this notebook is not¶
- This notebook is not designed to show state-of-the-art results. For example, it uses a relatively quantization-friendly model like Resnet18. Also, some optimization parameters are deliberately chosen to have the notebook execute more quickly. 
Dataset¶
This notebook relies on the ImageNet dataset for the task of image classification. If you already have a version of the dataset readily available, please use that. Else, please download the dataset from appropriate location (e.g. https://image-net.org/challenges/LSVRC/2012/index.php#) and convert them into tfrecords.
Note1: The ImageNet tfrecords dataset typically has the following characteristics and the dataloader provided in this example notebook rely on these - A folder containing tfrecords files starting with ‘train*’ for training files and ‘valid*’ for validation files. Each tfrecord file should have features: ‘image/encoded’ for image data and ‘image/class/label’ for its corresponding class.
Note2: To speed up the execution of this notebook, you may use a reduced subset of the ImageNet dataset. E.g. the entire ILSVRC2012 dataset has 1000 classes, 1000 training samples per class and 50 validation samples per class. But for the purpose of running this notebook, you could perhaps reduce the dataset to say 2 samples per class and then convert it into tfrecords. This exercise is left upto the reader and is not necessary.
Edit the cell below and specify the directory where the downloaded ImageNet dataset is saved.
[ ]:
DATASET_DIR = '/path/to/tfrecords/dir/'        # Please replace this with a real directory
We disable logs at the INFO level and disable eager execution. We set verbosity to the level as displayed (ERORR), so TensorFlow will display all messages that have the label ERROR (or more critical).
[ ]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
import tensorflow.compat.v1 as tf
tf.disable_eager_execution()
tf.logging.set_verbosity(tf.logging.ERROR)
1. Example evaluation and training pipeline¶
The following is an example training and validation loop for this image classification task.
- Does AIMET have any limitations on how the training, validation pipeline is written? Not really. We will see later that AIMET will modify the user’s model to create a QuantizationSim model which is still a PyTorch model. This QuantizationSim model can be used in place of the original model when doing inference or training. 
- Does AIMET put any limitation on the interface of the evaluate() or train() methods? Not really. You should be able to use your existing evaluate and train routines as-is. 
[ ]:
from typing import List
from Examples.common import image_net_config
from Examples.tensorflow.utils.image_net_evaluator import ImageNetDataLoader
from Examples.tensorflow.utils.image_net_evaluator import ImageNetEvaluator
from Examples.tensorflow.utils.image_net_trainer import ImageNetTrainer
class ImageNetDataPipeline:
    """
    Provides APIs for model evaluation and finetuning using ImageNet Dataset.
    """
    @staticmethod
    def get_val_dataloader():
        """
        Instantiates a validation dataloader for ImageNet dataset and returns it
        """
        data_loader = ImageNetDataLoader(TFRECORDS_DIR,
                                         image_size=image_net_config.dataset['image_size'],
                                         batch_size=image_net_config.evaluation['batch_size'],
                                         format_bgr=True)
        return data_loader
    @staticmethod
    def evaluate(sess: tf.Session) -> float:
        """
        Given a TF session, evaluates its Top-1 accuracy on the validation dataset
        :param sess: The sess graph to be evaluated.
        :return: The accuracy for the sample with the maximum accuracy.
        """
        evaluator = ImageNetEvaluator(TFRECORDS_DIR, training_inputs=['keras_learning_phase:0'],
                                      data_inputs=['input_1:0'], validation_inputs=['labels:0'],
                                      image_size=image_net_config.dataset['image_size'],
                                      batch_size=image_net_config.evaluation['batch_size'],
                                      format_bgr=True)
        return evaluator.evaluate(sess)
    @staticmethod
    def finetune(sess: tf.Session, update_ops_name: List[str], epochs: int, learning_rate: float, decay_steps: int):
        """
        Given a TF session, finetunes it to improve its accuracy
        :param sess: The sess graph to fine-tune.
        :param update_ops_name: list of name of update ops (mostly BatchNorms' moving averages).
                                tf.GraphKeys.UPDATE_OPS collections is always used
                                in addition to this list
        :param epochs: The number of epochs used during the finetuning step.
        :param learning_rate: The learning rate used during the finetuning step.
        :param decay_steps: A number used to adjust(decay) the learning rate after every decay_steps epochs in training.
        """
        trainer = ImageNetTrainer(TFRECORDS_DIR, training_inputs=['keras_learning_phase:0'],
                                  data_inputs=['input_1:0'], validation_inputs=['labels:0'],
                                  image_size=image_net_config.dataset['image_size'],
                                  batch_size=image_net_config.train['batch_size'],
                                  num_epochs=epochs, format_bgr=True)
        trainer.train(sess, update_ops_name=update_ops_name, learning_rate=learning_rate, decay_steps=decay_steps)
2. Load the model and evaluate to get a baseline FP32 accuracy score¶
For this example notebook, we are going to load a pretrained ResNet50 model from keras and covert it to a tensorflow session. Similarly, you can load any pretrained tensorflow model instead.
Calling clear_session() releases the global state: this helps avoid clutter from old models and layers, especially when memory is limited.
By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. Since batchnorm ops are folded, these need to be ignored during training.
[ ]:
from tensorflow.compat.v1.keras.applications.resnet import ResNet50
tf.keras.backend.clear_session()
model = ResNet50(weights='imagenet', input_shape=(224, 224, 3))
update_ops_name = [op.name for op in model.updates] # Used for finetuning
The following utility method in AIMET sets BN layers in the model to eval mode. This allows AIMET to more easily read the BN parameters from the graph. Eventually we will fold BN layers into adjacent conv layers.
[ ]:
from aimet_tensorflow.utils.graph import update_keras_bn_ops_trainable_flag
model = update_keras_bn_ops_trainable_flag(model, load_save_path="./", trainable=False)
AIMET features currently support tensorflow sessions. add_image_net_computational_nodes_in_graph adds an output layer, softmax and loss functions to the Resnet50 model graph.
[ ]:
from Examples.tensorflow.utils.add_computational_nodes_in_graph import add_image_net_computational_nodes_in_graph
sess = tf.keras.backend.get_session()
# Creates the computation graph of ResNet within the tensorflow session.
add_image_net_computational_nodes_in_graph(sess, model.output.name, image_net_config.dataset['images_classes'])
Since all tensorflow input and output tensors have names, we identify the tensors needed by AIMET APIs here.
[ ]:
starting_op_names = [model.input.name.split(":")[0]]
output_op_names = [model.output.name.split(":")[0]]
We are checking if TensorFlow is using CPU or CUDA device. This example code will use CUDA if available in your current execution environment.
[ ]:
use_cuda = tf.test.is_gpu_available(cuda_only=True):
Let’s determine the FP32 (floating point 32-bit) accuracy of this model using the evaluate() routine
[ ]:
accuracy = ImageNetDataPipeline.evaluate(sess=sess)
print(accuracy)
3. Create a quantization simulation model and determine quantized accuracy¶
Fold Batch Normalization layers¶
Before we determine the simulated quantized accuracy using QuantizationSimModel, we will fold the BatchNormalization (BN) layers in the model. These layers get folded into adjacent Convolutional layers. The BN layers that cannot be folded are left as they are.
Why do we need to this? On quantized runtimes (like TFLite, SnapDragon Neural Processing SDK, etc.), it is a common practice to fold the BN layers. Doing so, results in an inferences/sec speedup since unnecessary computation is avoided. Now from a floating point compute perspective, a BN-folded model is mathematically equivalent to a model with BN layers from an inference perspective, and produces the same accuracy. However, folding the BN layers can increase the range of the tensor values for the weight parameters of the adjacent layers. And this can have a negative impact on the quantized accuracy of the model (especially when using INT8 or lower precision). So, we want to simulate that on-target behavior by doing BN folding here.
The following code calls AIMET to fold the BN layers in-place on the given model
[ ]:
from aimet_tensorflow.batch_norm_fold import fold_all_batch_norms
BN_folded_sess, _ = fold_all_batch_norms(sess,
                                         input_op_names=starting_op_names,
                                         output_op_names=output_op_names)
Create Quantization Sim Model¶
Now we use AIMET to create a QuantizationSimModel. This basically means that AIMET will insert fake quantization ops in the model graph and will configure them. A few of the parameters are explained here - quant_scheme: We set this to “QuantScheme.post_training_tf_enhanced” - Supported options are ‘tf_enhanced’ or ‘tf’ or using Quant Scheme Enum QuantScheme.post_training_tf or QuantScheme.post_training_tf_enhanced - default_output_bw: Setting this to 8, essentially means that we are asking AIMET to perform all activation quantizations in the model using integer 8-bit precision - default_param_bw: Setting this to 8, essentially means that we are asking AIMET to perform all parameter quantizations in the model using integer 8-bit precision - num_batches: The number of batches used to evaluate the model while calculating the quantization encodings.Number of batches to use for computing encodings. Only 5 batches are used here to speed up the process. In addition, the number of images in these 5 batches should be sufficient for compute encodings - rounding_mode: The rounding mode used for quantization. There are two possible choices here - ‘nearest’ or ‘stochastic’ We will use “nearest.”
There are other parameters that are set to default values in this example. Please check the AIMET API documentation of QuantizationSimModel to see reference documentation for all the parameters.
The next cell sets up the quantizer, and quantizes the model. The new session that contains all the changes to the graph is quantizer.session, and this is then evaluated on the dataset. Note that the quantizer uses the same evaluate function as the one defined in our data pipeline when computing the new weights.
[ ]:
from aimet_common.defs import QuantScheme
from aimet_tensorflow.quantsim import QuantizationSimModel
sim = QuantizationSimModel(session=BN_folded_sess,
                           starting_op_names=starting_op_names,
                           output_op_names=output_op_names,
                           quant_scheme= QuantScheme.training_range_learning_with_tf_enhanced_init,
                           rounding_mode="nearest",
                           default_output_bw=8,
                           default_param_bw=8,
                           use_cuda=use_cuda)
Compute Encodings¶
Even though AIMET has added ‘quantizer’ nodes to the model graph but the model is not ready to be used yet. Before we can use the sim model for inference or training, we need to find appropriate scale/offset quantization parameters for each ‘quantizer’ node. For activation quantization nodes, we need to pass unlabeled data samples through the model to collect range statistics which will then let AIMET calculate appropriate scale/offset quantization parameters. This process is sometimes referred to as calibration. AIMET simply refers to it as ‘computing encodings’.
So we create a routine to pass unlabeled data samples through the model. This should be fairly simple - use the existing train or validation data loader to extract some samples and pass them to the model. We don’t need to compute any loss metric etc. So we can just ignore the model output for this purpose. A few pointers regarding the data samples
In practice, we need a very small percentage of the overall data samples for computing encodings. For example, the training dataset for ImageNet has 1M samples. For computing encodings we only need 500 or 1000 samples. It may be beneficial if the samples used for computing encoding are well distributed. It’s not necessary that all classes need to be covered etc. since we are only looking at the range of values at every layer activation. However, we definitely want to avoid an extreme scenario like all ‘dark’ or ‘light’ samples are used - e.g. only using pictures captured at night might not give ideal results. The following shows an example of a routine that passes unlabeled samples through the model for computing encodings. This routine can be written in many different ways, this is just an example.
[ ]:
def pass_calibration_data(session: tf.Session, _):
    data_loader = ImageNetDataPipeline.get_val_dataloader()
    batch_size = data_loader.batch_size
    input_label_tensors = [session.graph.get_tensor_by_name('input_1:0'),
                           session.graph.get_tensor_by_name('labels:0')]
    train_tensors = [session.graph.get_tensor_by_name('keras_learning_phase:0')]
    train_tensors_dict = dict.fromkeys(train_tensors, False)
    eval_outputs = [session.graph.get_operation_by_name('top1-acc').outputs[0]]
    samples = 500
    batch_cntr = 0
    for input_label in data_loader:
        input_label_tensors_dict = dict(zip(input_label_tensors, input_label))
        feed_dict = {**input_label_tensors_dict, **train_tensors_dict}
        with session.graph.as_default():
            _ = session.run(eval_outputs, feed_dict=feed_dict)
        batch_cntr += 1
        if (batch_cntr * batch_size) > samples:
            break
Now we call AIMET to use the above routine to pass data through the model and then subsequently compute the quantization encodings. Encodings here refer to scale/offset quantization parameters.
[ ]:
sim.compute_encodings(forward_pass_callback=pass_calibration_data,
                      forward_pass_callback_args=None)
Now the QuantizationSim model is ready to be used for inference or training. First we can pass this model to the same evaluation routine we used before. The evaluation routine will now give us a simulated quantized accuracy score for INT8 quantization instead of the FP32 accuracy score we saw before.
[ ]:
accuracy = ImageNetDataPipeline.evaluate(sim.model, use_cuda)
print(accuracy)
4. 1 Cross Layer Equalization¶
The next cell performs cross-layer equalization on the model. As noted before, the function folds batch norms, applies cross-layer scaling, and then folds high biases.
Note: Interestingly, CLE needs BN statistics for its procedure. If a BN folded model is provided, CLE will run the CLS (cross-layer scaling) optimization step but will skip the HBA (high-bias absorption) step. To avoid this, we simply load the original model again before running CLE.
Note: CLE equalizes the model in-place
[ ]:
from aimet_tensorflow import cross_layer_equalization as aimet_cle
cle_applied_sess = aimet_cle.equalize_model(sess,
                                            start_op_names=start_op_names,
                                            output_op_names=output_op_names)
Now, we can determine the simulated quantized accuracy of the equalized model. We again create a simulation model like before and evaluate to determine simulated quantized accuracy.
[ ]:
sim = QuantizationSimModel(session=cle_applied_sess,
                           starting_op_names=starting_op_names,
                           output_op_names=output_op_names,
                           quant_scheme= QuantScheme.training_range_learning_with_tf_enhanced_init,
                           rounding_mode="nearest",
                           default_output_bw=8,
                           default_param_bw=8,
                           use_cuda=use_cuda)
sim.compute_encodings(forward_pass_callback=pass_calibration_data,
                      forward_pass_callback_args=None)
accuracy = ImageNetDataPipeline.evaluate(sim.model, use_cuda)
print(accuracy)
4. 2 Bias Correction¶
This section shows how we can apply AIMET Bias Correction on top of the already equalized model from the previous step. Bias correction under the hood uses a reference FP32 model and a QuantizationSimModel to perform its procedure. More details are explained in the AIMET User Guide documentation.
For the correct_bias API, we pass the following parameters
- num_quant_samples: Number of samples used for computing encodings. We are setting this to a low number here to speed up execution. A typical number would be 500-1000. 
- num_bias_correct_samples: Number of samples used for bias correction. We are setting this to a low number here to speed up execution. A typical number would be 1000-2000. 
- data_loader: BC uses unlabeled data samples from this data loader. 
[ ]:
from aimet_tensorflow import bias_correction as aimet_bc
quant_params = aimet_bc.QuantParams(quant_mode= QuantScheme.post_training_tf_enhanced, round_mode="nearest",
                                    use_cuda=use_cuda, ops_to_ignore=[])
bias_correction_params = aimet_bc.BiasCorrectionParams(batch_size=56,
                                                       num_quant_samples=16,
                                                       num_bias_correct_samples=16,
                                                       input_op_names=start_op_names,
                                                       output_op_names=output_op_names)
after_bc_sess = aimet_bc.BiasCorrection.correct_bias(sess, bias_correct_params=bias_correction_params,
                                                     quant_params=quant_params,
                                                     data_set=data_loader.dataset)
Now, we can determine the simulated quantized accuracy of the bias-corrected model. We again create a simulation model like before and evaluate to determine simulated quantized accuracy.
[ ]:
sim = QuantizationSimModel(session=BN_folded_sess,
                           starting_op_names=['input_1'],
                           output_op_names=[model.output.name.split(":")[0]],
                           quant_scheme= QuantScheme.training_range_learning_with_tf_enhanced_init,
                           rounding_mode="nearest",
                           default_output_bw=8,
                           default_param_bw=8,
                           use_cuda=use_cuda)
sim.compute_encodings(forward_pass_callback=pass_calibration_data,
                      forward_pass_callback_args=None)
accuracy = ImageNetDataPipeline.evaluate(sim.model, use_cuda)
print(accuracy)
Depending on your settings you may have observed a slight gain in accuracy after one epoch of training. Ofcourse, this was just an example. Please try this against the model of your choice and play with the hyper-parameters to get the best results.
So we should have an improved model after QAT. Now the next step would be to actually take this model to target. For this purpose, we need to export the model with the updated weights without the fake quant ops. AIMET QuantizationSimModel provides an export API for this purpose. This API would save the model as #TODO
[ ]:
os.makedirs('./output/', exist_ok=True)
sim.export(path='./output/', filename_prefix='resnet50_after_qat_range_learning')
Summary¶
Hope this notebook was useful for you to understand how to use AIMET for performing Cross Layer Equalization (CLE) and Bias Correction (BC).
Few additional resources - Refer to the AIMET API docs to know more details of the APIs and optional parameters - Refer to the other example notebooks to understand how to use AIMET post-training quantization techniques and QAT techniques