Operator and Datatype support¶

Operator support¶

In this section we will discuss the operator support available for the device across various frameworks.

To determine the list of operators supported by the device for various frameworks, you can execute the following command

Example command to generate operators supported for onnx

/opt/qti-aic/exec/qaic-exec -operators-supported=onnx

This command generators a file OnnxSupportedOperators.txt which comprehensive list of ops supported. It is important to note that the operator support keep expanding with the release of new SDK versions.

Info

-operators-supported supports only onnx, tensorflow, pytorch.

Info

onnx is the preferred format to compile the model for the device.

Handling Unsupported Operators¶

In some cases, you might encounter errors related to unsupported operations while compiling the model for the device.

For instance, certain operations like einsum present in the model file might not be directly supported by the device.

In such scenarios, the Model Preparator tool can be employed to modify the model and substitute these unsupported operations with their corresponding mathematical equivalent subgraphs.

Datatype Support¶

FP32 (Single Precision Floating Point)¶

Models can be executed in FP32 for use cases where accuracy is critical and computational efficiency is not a primary concern. It is essential to note that FP32 models tend to have larger sizes and will exhibit lower throughput performance. FP32 execution is supported but not recommended. Do not use the -convert-to-fp16 with qaic-exec (compiler CLI) during compilation.

FP16 (Half Precision Floating Point)¶

FP16 strikes a balance between accuracy and efficiency, making it suitable for most deep learning workloads.

If a model is originally trained in FP32 format, it can be down-converted to FP16 during the compilation process using the -convert-to-fp16 flag. However, certain scenarios may involve constants beyond the FP16 range. In such cases, it is recommended to clip values to the FP16 range (as demonstrated in the fix_onnx_fp16 function in the NLP tutorials in Cloud-ai-sdk repo).

FP8 (8-bit Floating Point)¶

Models in FP8 format are supported through Qualcomm Efficient-Transformers.

Shared Micro-exponents (Narrow Precision Format)¶

The shared micro-exponent spec is here. The Cloud AI compiler will support all of the formats over time. Currently, MXFP6 is supported by the compiler.

Models compiled in MXFP6 format stores FP32/FP16 weights using a 6-bit format. By doing so, it significantly reduces model sizes due to the compact representation. This format is particularly beneficial for models that require high data bandwidth, such as large language models (LLMs).

AI 100 stores Matmul weights in MXFP6 format while keeping the rest of the weights in FP16 format. Computation/activations on the NSP still occur in FP16.

LLMs experience up to 2x throughput with minimal accuracy loss with MXFP6 format. Within a constant memory footprint, a larger model can be supported with MXFP6.

FP32 models can be compiled into MXFP6 format using the compiler flag -mxfp6-matmul. FP16 execution should use both -convert-to-fp16 and -mxfp6-matmul flags for the qaic-exec compiler CLI.

INT8 (8-bit Integer)¶

AI 100 supports INT8 quantized models, especially relevant for Natural Language Processing (NLP) and Computer Vision (CV) tasks.

Quantization Methods Supported:

Quantization Schema for Weights and Activations: Both symmetric and asymmetric.
Quantization Calibration: Options include KLMinimization, KLMinimizationV2, MSE, SQNR, and Percentile (with percentile calibration values: 99.9, 99.99, 99.999, 99.9999).

The SDKs provide tools to run a profile-guided quantization (PGQ) sweep, allowing you to identify the optimal quantization parameters for your specific requirements.

INT4 (4-bit Integer)¶

AWQ and GPTQ quantized models with 4-bit weights are supported through Qualcomm Efficient-Transformers

BF16 (BFloat16)¶

If a model is trained in BF16 (bfloat16), ensure that the weights are scaled down using an appropriate scaling factor. This prevents intermediate activations from overflowing into fp16. Qualcomm can provide a script to identify the scaling factors to scale down the weights of the models such that intermediate activations will not overflow fp16.