Automatic Mixed-Precision (AMP)

This notebook shows a working code example of how to use AIMET to perform Auto Mixed Precision (AMP). AMP is a technique where given a quantized accuracy target, AIMET finds bit-precision per-layer to meet that accuracy target while trying to optimize the model for inference speed.

As an example, say a particular model is not meeting a desired accuracy target when run in INT8. The Auto Mixed Precision feature will find a minimal set of layers that need to run on say INT16 to get to the desired accuracy. It should be noted that choosing higher precision for some layers necessarily involves a trade-off: lower inferences/sec for higher accuracy and vice-versa.

Alternatively, the AMP feature can be used to generate a pareto curve (accuracy vs. bit-ops) that can guide the user to decide the right operating point for this tradeoff.

This notebook specifically shows working code example for the above.

Overall flow

This notebook covers the following

  1. Instantiate the example evaluation method

  2. Load the FP32 model and evaluate the model to find the baseline FP32 accuracy

  3. Create a quantization simulation model (with fake quantization ops inserted)

  4. Run AMP algorithm on the quantized model

    1. Using the Regular AMP method

    2. Using the Fast AMP Method (AMP 2.0)

What this notebook is not

  • This notebook is not designed to show state-of-the-art AMP results. For example, it uses a relatively quantization-friendly model like ResNet50. Also, some optimization parameters like number of samples for evaluation are deliberately chosen to have the notebook execute more quickly.


Dataset

This notebook relies on the ImageNet dataset for the task of image classification. If you already have a version of the dataset readily available, please use that. Else, please download the dataset from appropriate location (e.g. https://image-net.org/challenges/LSVRC/2012/index.php#).

[ ]:
DATASET_DIR = '/path/to/dataset'        # Please replace this with a real directory
BATCH_SIZE = 32

We disable logs at the INFO level. We set verbosity to the level as displayed (ERORR), so TensorFlow will display all messages that have the label ERROR (or more critical).

[ ]:
import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = "2"
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

1. Instantiate the example evaluation method

The following is an example evlauation method which we will used to evaluate the accuracy for the model as well to perform a forward pass on the model. AIMET needs forward pass for calculating the range of values at activations of each layer.Below is an example function which we will use for both evaluation and the forward pass.

[ ]:
from tensorflow.keras.applications.resnet import preprocess_input, decode_predictions

def center_crop(image):

    img_height = 256
    img_width = 256
    crop_length = 224

    start_x = (img_height - crop_length) // 2
    start_y = (img_width - crop_length) // 2
    cropped_image=image[: ,  start_x:(img_width - start_x), start_y:(img_height - start_y), :]

    return cropped_image


def get_eval_func(dataset_dir, batch_size, num_iterations=50000, debug=False, get_top5_acc=False):

    def func_wrapper(model, iterations=None, use_cuda=True):

        validation_ds = tf.keras.preprocessing.image_dataset_from_directory(
            directory=dataset_dir,
            labels='inferred',
            label_mode='categorical',
            batch_size=batch_size,
            shuffle = False,
            image_size=(256, 256))
        # If no iterations specified, set to full validation set
        if not iterations:
            iterations = num_iterations
        else:
            iterations = iterations * batch_size
        top1 = 0
        top5 = 0
        total = 0
        for (img,label) in validation_ds:
            img = center_crop(img)
            x = preprocess_input(img)
            preds = model.predict(x,batch_size = batch_size)
            label = np.where(label)[1]
            label = [validation_ds.class_names[int(i)] for i in label]
            cnt = sum([1 for a, b in zip(label, decode_predictions(preds, top=1)) if str(a) == b[0][0]])
            top1 += cnt
            cnt = sum([1 for a, b in zip(label, decode_predictions(preds, top=5)) if str(a) in [i[0] for i in b]])
            top5 += cnt
            total += len(label)
            if total >= iterations:
                break
        if get_top5_acc == True:
            return top1/total, top5/total
        else:
            return top1/total
    return func_wrapper



# Instantiate the evaluation function
eval_func = get_eval_func(DATASET_DIR, BATCH_SIZE)

2. Load the FP32 model and evaluate the model to find the baseline FP32 accuracy

For this example notebook, we are going to load a pretrained ResNet50 model from keras . Similarly, you can load any pretrained tensorflow model instead.

[ ]:
from tensorflow.keras.applications.resnet import ResNet50
from aimet_tensorflow.keras.batch_norm_fold import fold_all_batch_norms

def get_model():
    model = ResNet50(
        include_top=True,
        weights="imagenet",
        input_tensor=None,
        input_shape=None,
        pooling=None,
        classes=1000)

    return model

model = get_model()
# We will perform the batch norm folding on the loaded model.

_ = fold_all_batch_norms(model)

# calculate the FP32 model acccuracy

fp32_acccuracy = eval_func(model, None)


3.Create a quantization simulation model (with fake quantization ops inserted)

Now we use AIMET to create a QuantizationSimModel. This basically means that AIMET will insert fake quantization ops in the model graph and will configure them.

A few of the parameters are explained here

  • quant_scheme: We set this to “QuantScheme.post_training_tf_enhanced”

    • Supported options are ‘tf_enhanced’ or ‘tf’ or using Quant Scheme Enum QuantScheme.post_training_tf or QuantScheme.post_training_tf_enhanced

  • default_output_bw: Setting this to 8, essentially means that we are asking AIMET to perform all activation quantizations in the model using integer 8-bit precision

  • default_param_bw: Setting this to 8, essentially means that we are asking AIMET to perform all parameter quantizations in the model using integer 8-bit precision

There are other parameters that are set to default values in this example. Please check the AIMET API documentation of QuantizationSimModel to see reference documentation for all the parameters.

[ ]:
from aimet_common.defs import QuantScheme
from aimet_tensorflow.keras.quantsim import QuantizationSimModel

sim = QuantizationSimModel(
        model=model,
        quant_scheme=QuantScheme.post_training_tf_enhanced,
        rounding_mode="nearest",
        default_output_bw=8,
        default_param_bw=8
    )

Compute Encodings

Even though AIMET has added ‘quantizer’ nodes to the model graph but the model is not ready to be used yet. Before we can use the sim model for inference or training, we need to find appropriate scale/offset quantization parameters for each ‘quantizer’ node. For activation quantization nodes, we need to pass unlabeled data samples through the model to collect range statistics which will then let AIMET calculate appropriate scale/offset quantization parameters. This process is sometimes referred to as calibration. AIMET simply refers to it as ‘computing encodings’.

The following shows an example of a routine that passes unlabeled samples through the model for computing encodings. This routine can be written in many different ways, this is just an example.

[ ]:
sim.compute_encodings(eval_func, forward_pass_callback_args=500)

4. Run AMP algorithm on the quantized model

AMP algorithm runs in 3 phases (phase-3 is optional). Phase-1 comprises of calculating the sensitivity for each layer. Phase-2 comprises of greedily selecting which layers should have what bitwidth based on options provided by the user. Phase-3 derives a set of mixed-precision solutions having less bitwidth convert op overhead compared to original phase-2 solution. For phase 1 and phase 2 we require to pass data through the model.

So we create a routine to pass unlabeled data samples through the model. This should be fairly simple - use the existing train or validation data loader to extract some samples and pass them to the model. We don’t need to compute any loss metric etc. So we can just ignore the model output for this purpose. A few pointers regarding the data samples

  • In practice, we need a very small percentage of the overall data samples for computing encodings. For example, the training dataset for ImageNet has 1M samples. For phase 1 we only need 500 or 1000 samples. For phase 2 it is recommended to use all of validation data. This is done to speed up AMP execution. Therefore, we define 2 separate functions for phase 1 and phase 2.

  • For phase 2, if a large-enough subset of the samples provide a meaningful accuracy score, we can use the subset of samples to speed up the AMP algorithm

  • It may be beneficial if the samples used for forward pass are well distributed. It’s not necessary that all classes need to be covered etc. since we are only looking at the range of values at every layer activation. However, we definitely want to avoid an extreme scenario like all ‘dark’ or ‘light’ samples are used - e.g. only using pictures captured at night might not give ideal results. We have two method for doing AMP in Keras. One can opt for any one of the methods.

    1. Regular AMP

    2. Fast AMP (AMP 2.0)

Parameters for AMP algorithm

A few of the parameters required for AMP are explained below

  • candidates : It is a list of tuples for all possible bitwidth values for activations and parameters. Suppose the possible combinations are-((Activation bitwidth - 8, Activation data type - int), (Parameter bitwidth - 16, parameter data type - int)) ((Activation bitwidth - 16, Activation data type - float), (Parameter bitwidth - 16, parameter data type - float)) candidates will be [((8, QuantizationDataType.int), (16, QuantizationDataType.int)), ((16, QuantizationDataType.float), (16, QuantizationDataType.float))]

  • allowed_accuracy_drop : Maximum allowed drop in accuracy from FP32 baseline. The pareto front curve is plotted only till the point where the allowable accuracy drop is met. To get a complete plot for picking points on the curve, the user can set the allowable accuracy drop to None.

  • results_dir : Path to save results and cache intermediate results

  • clean_start : If true, any cached information from previous runs will be deleted prior to starting the mixed-precision analysis. If false, prior cached information will be used if applicable. Note it is the user’s responsibility to set this flag to true if anything in the model or quantization parameters changes compared to the previous run.

[ ]:
from aimet_common.defs import QuantizationDataType

candidates = [((16, QuantizationDataType.int), (8, QuantizationDataType.int)),
              ((8, QuantizationDataType.int), (8, QuantizationDataType.int))]

allowed_accuracy_drop = 0.01

results_dir = '/path/to/where/we/want/to/store/intermediate/and/final/results'

Regular AMP

In this we will need regular eval function as discussed above.

[ ]:
from aimet_common.defs import CallbackFunc

eval_callback_phase1 = CallbackFunc(eval_func, 500)
eval_callback_phase2 = CallbackFunc(eval_func, None)
forward_pass_call_back = CallbackFunc(eval_func, 500)

API Call for Regular AMP

[ ]:
from aimet_tensorflow.keras.mixed_precision import choose_mixed_precision
from aimet_tensorflow.keras.amp.mixed_precision_algo import GreedyMixedPrecisionAlgo

# Enable phase-3 (optional)
# GreedyMixedPrecisionAlgo.ENABLE_CONVERT_OP_REDUCTION = True
# Note: supported candidates ((8,int), (8,int)) & ((16,int), (8,int))

choose_mixed_precision(sim, candidate, eval_callback_phase1, eval_callback_phase2, allowed_accuracy_drop,
                      results_dir, True, forward_pass_call_back)

Fast AMP (AMP 2.0)

In this method of AMP instead of using the acuracy score for the evaluation in the phase one we use the SQNR score. This speeds up the phase 1 computation of the AMP and saves some time. To use this version we need require a data loader wrapper instead of phase 1 evaluation callback. Below is a sample code for the same and also a sample call to fast AMP API.

[ ]:
def get_data_loader_wrapper(dataset_dir, batch_size, is_training=False):

    def dataloader_wrapper():
        dataloader = tf.keras.preprocessing.image_dataset_from_directory(
            directory=dataset_dir,
            labels='inferred',
            label_mode='categorical',
            batch_size=batch_size,
            shuffle = is_training,
            image_size=(256, 256))

        return dataloader.map(lambda x, y: preprocess_input(center_crop(x)))

    return dataloader_wrapper

data_loader_wrapper = get_data_loader_wrapper(DATASET_DIR, BATCH_SIZE)
[ ]:
from aimet_tensorflow.keras.mixed_precision import choose_fast_mixed_precision
from aimet_tensorflow.keras.amp.mixed_precision_algo import GreedyMixedPrecisionAlgo

# Enable phase-3 (optional)
# GreedyMixedPrecisionAlgo.ENABLE_CONVERT_OP_REDUCTION = True
# Note: supported candidates ((8,int), (8,int)) & ((16,int), (8,int))

choose_fast_mixed_precision(sim, candidate, data_loader_wrapper, eval_callback_phase2, allowed_accuracy_drop,
                      results_dir, True, forward_pass_call_back)

So we have a Mixed precision model after AMP. Now the next step would be to actually take this model to target. For this purpose, we need to export the model.

[ ]:
os.makedirs('./output/', exist_ok=True)
sim.export(path='./output/', filename_prefix='resnet50_after_amp')

Summary

Hope this notebook was useful for you to understand how to use Automatic Mixed Precision in Keras. For more details about the parameters and configuration please refer api docs for mixed precision.

[ ]: