On-target inference¶
In order to run AIMET quantized model on a target device, you need following two things:
an exported model,
an encodings JSON file containing quantization parameters (like encoding min/max/scale/offset) associated with each quantizers.
AIMET QuantizationSimModel
provides QuantizationSimModel.export()
functionality
to generate both the items. The exported model type will differ based on the framework used:
Framework |
Format |
---|---|
PyTorch |
.onnx |
ONNX |
.onnx |
TensorFlow |
.h5 or .pb |
Qualcomm® AI hub¶
Qualcomm® AI Hub simplifies the AI model deployment on a device with runtimes like Qualcomm® AI Engine Direct, TensorFlow Lite and ONNX Runtime.
Once the AIMET exported model and an encodings JSON file have been obtained, the artifacts can be passed to the Qualcomm® AI Hub for compilation, profiling and inference.
Follow these instructions to compile AIMET quantized model and then submit an inference job using selected device.
Qualcomm® AI Engine Direct SDK¶
Qualcomm® AI Engine Direct also enables to run AI model inference on a device.
Once the AIMET exported model and an encodings JSON file have been obtained, the artifacts can be passed to the Qualcomm® AI Engine Direct tools for conversion, quantization, compilation and execution.
Conversion¶
Qualcomm® AI Engine Direct SDK qairt-converter
tool converts a model from PyTorch/ONNX/TensorFlow framework to a equivalent DLC (*.dlc
) graph format representation.
The encoding files generated from the AIMET workflow are provided as an input to this step via the –-quantization_overrides
option.
Basic command line usage looks like:
qairt-converter --input_network <AIMET_exported_model_path> --quantization_overrides <AIMET_exported_model.encodings>
--output_path <non-quantized_dlc>
arguments:
--input_network <AIMET_exported_model_path>
Path to the AIMET exported (PyTorch/ONNX/TensorFlow) model
--quantization_overrides <AIMET_exported_model.encodings>
Path to the AIMET exported encodings JSON file containing quantization parameters
--output_path <non-quantized_dlc>
Path where the converted non-quantized DLC (*.dlc) should be saved.
This step generates a DLC (*.dlc
) file that represents the model as a series of QAIRT API calls.
Please refer the Qualcomm® AI Engine Direct documentation for more details.
Quantization¶
Qualcomm® AI Engine Direct SDK qairt-quantizer
tool converts a non-quantized DLC (*.dlc
) model into quantized (*.dlc
) model.
Basic command line usage looks like:
qairt-quantizer --input_dlc <non-quantized_dlc> --output_dlc <quantized_dlc>
--float_fallback
arguments:
--input_dlc <non-quantized_dlc>
Path to the non-quantized DLC (*.dlc) container containing the model
--output_dlc <quantized_dlc>
Path at which the quantized DLC (*.dlc) container will be saved.
--float_fallback
Enables fallback option to FP32 for ops whose quantization parameters are missing in the provided encodings JSON file.
Please refer the Qualcomm® AI Engine Direct documentation for more details.
Compilation¶
Qualcomm® AI Engine Direct SDK qnn-context-binary-generator
tool compiles the quantized DLC (*.dlc
) from the previous step into QNN
serialized context binary applicable to the Qualcomm® AI Engine Direct HTP backend.
Basic command line usage looks like:
qnn-context-binary-generator --model <libQnnModelDlc.so> --backend <libQnnHtp.so>
--dlc_path <quantized_dlc>
--output_dir <output_dir_path>
--binary_file <binary_file_name>
arguments:
--model <libQnnModelDlc.so>
Path to QNN <libQnnModelDlc.so> file.
--backend <libQnnHtp.so>
Path to a QNN backend <libQnnHtp.so> library to create the context binary.
--dlc_path <quantized_dlc>
Path to quantized (*.dlc) from which to load the model.
--output_dir <output_dir_path>
The directory to save output to.
--binary_file <binary_file_name>
Name of the binary file to save the serialized context binary to with ``.bin`` file extension.
Upon completion of this step, QNN context binaries for the model is available in /output_dir_path/binary_file_name.bin
.
Please refer the Qualcomm® AI Engine Direct documentation for additional Qualcomm® AI Engine Direct HTP backend specific optional arguments.
Execution¶
Qualcomm® AI Engine Direct SDK qnn-net-run
tool executes the model (represented as serialized context binary) on the desired target.
Basic command line usage looks like:
qnn-net-run --backend <libQnnHtp.so> --retrieve_context <binary_file_name>
--input_list <input_list>.txt --output_dir <output_path>
arguments:
--backend <libQnnHtp.so>
Path to a QNN backend <libQnnHtp.so> library to execute the model.
--retrieve_context <binary_file_name>
Path to serialized context binary from which to load a saved context from.
--input_list <input_list.txt>
Path to a file containing the inputs for the model.
--output_dir <output_dir_path>
The directory to save output to.
Please refer the Qualcomm® AI Engine Direct documentation for additional Qualcomm® AI Engine Direct HTP backend specific optional arguments.