Inference Workflow¶

Developers can deploy a pre-trained model for inference on any Cloud AI platform using the ONNX or PyTorch eager mode workflows.

Refer to these sections for workflow details:

ONNX workflow¶

Export, compile and execute models in 3 easy steps:

Export the model in ONNX format and prepare the model
Compile the model to generate a QPC (Qaic Program Container)
Execute, integrate and deploy into production pipeline

../../_images/AIC_Inference_workflow.jpg

Cloud AI SDK provides tools for each of the steps as shown below.

Export and prepare the Model¶

Exporting and preparing the pre-trained model as explained in this section is a requirement to extract the best performance and accuracy. Exporting the model in ONNX format is strongly recommended due to operator support. The ONNX file is then passed through a few checks (for model accuracy and performance) before it can be compiled in the next step.

Compile the Model¶

Compilation of the prepared ONNX file generates a QPC which can be loaded and executed on Cloud AI devices. To get the best performance from the device based on user requirements of throughput and latency, compilation needs to be done with the right parameters. Tune the performance goes over how to derive the best parameters for compilation.

Execute, integrate and deploy in production pipeline¶

This section will go over the different ways (CLI, C++, Python etc) for developers to execute inferences on Cloud AI platforms and all the tools/utilities available to integrate into inference application and deploy in the production pipeline.

PyTorch eager mode¶

Eager mode execution is supported on Cloud AI platforms via the torch_qaic package. In this mode, operators are executed immediately as they are encountered.