Skip to content

qaic package


qaic - qaic package provides a way to run inference on Qualcomm Cloud AI 100 card.

Description


A user can create a session object by passing in

  1. with a .onnx file.

  2. with a precompiled qpc as model_path, in case a user already has compiled qpc. Full path to qpc.bin should be passed while using precompiled binary.

Note

QPC : Qualcomm Program Container

Example on how to run an inference

option 1 : Compiles the onnx file to generate QPC and sets up session for inference
import qaic
import numpy as np
sess = qaic.Session('/path/to/model/model.onnx') 
input_dict = {'input_name': input_data}
output = sess.run(input_dict)
option 2 : Uses generate QPC and sets up session for inference
import qaic
import numpy as np
sess = qaic.Session('/path/to/model/qpc.bin') # option 2 : Session uses compiled QPC file to 
input_dict = {'input_name': input_data}
output = sess.run(input_dict)

Example for benchmarking

import qaic
sess = qaic.Session(model_path='/path/to/model', backend='aic', options_path = '/path/to/yaml') # model_path can be either onnx or precompiled qpc
inf_completed, inf_rate, inf_time, batch_size = sess.run_benchmark()

Limitations

  • Currently only QAic backend is supported. We plan support for QNN backend in future releases.
  • APIs are compatible with only Python 3.8
  • These APIs are supported only on x86-64 platforms

class Session


Session is the entry point of these APIs. Session is a factory method which user needs to call to create an instance of session. A model is compiled by default when creating a session.

Session(model_path, **kwargs)

Session creates a session object based on the qpc provided.

Parameters

Parameter Type Description
model_path str path to .onnx file or .bin file i.e. the compiled model QPC
**kwargs refer to the keyword arguments listed below
Keyword Arguments Type Description
dev_id int Device on which to run the inference. Default is 0
num_activations int Number of instances on network to be activated.
set_size int Number of ExecObj to be created.
mos int Effort level to reduce the on-chip memory.
ols int Factor to increasing splitting of network for parallelism
aic_num_cores int Number of aic cores to be used for inference
convert_to_fp16 bool Run all floating-point in fp16
onnx_define_symbol (list[tuple(str, int)]) Define an onnx symbol with its value
output_dir str Stores model binaries at directory location provided
output_node_names (list[str]) Output node names should be in the order as present in model file. This option is mandatory for TF models
model_inputs list dict `Provide input node name with its data type and shape. Dict must contain keys 'input_name', 'input_type','input_shape'. This is mandatory for pytorch models
allocator_dealloac_delay int Option to increase the lifetime of buffers to reduce false dependencies
size_split_granularity int Option to specify a maximum tile size target for operations that may be too large to execute out of fast memory. Tile size in KiB between 512 - 2048
vtcm_working_set_limit_ratio float Option to Specify the maximum ratio amount of fast memory to DDR any single instruction is allowed use of all available value between
execute_nodes_in_fp16 (list[str]) Run all insances of the operators in this list with FP16
node_precision_info_file str Load model loader precision file which contains first output name of operator instances required to be executed in FP16 or FP32.
keep_original_precision_for_nodes (list[str]) Run all insances of the operators in this list with original precision during generation of quantized precision model even if the operator is supported in Int8 precision
custom_io_list_file str Custom I/O config file in yaml format containing layout, precision scale and offset for each input and output of the model.
dump_custom_io_config_template_file str Dumps the yaml template for Custom I/O configuration
external_quantization_file str Load the externally generated quantization profile
quantization_schema_activations str Specify which quantization schema to use for activations. Valid options: asymmetric, symmetric, symmetric_with_uint8 (default), symmetric_with_power2_scale
quantization_schema_constants str Specify which quantization schema to use for constants. Valid options: asymmetric, symmetric, symmetric_with_uint8
quantization_calibration str Specify which quantization calibration to use Default is None (MinMax calibration is applied). Valid options: None (default), KLMinimization, KLMinimizationV2, Percentile, MSE and SQNR.
percentile_calibration_value float Specify the percentile value to be used with Percentile calibration method. The specified float value must lie within 90 and 100, default: 100.
num_histogram_bins int Sets the num of histogram bins that will be used in profiling every node. Default value is 512
quantization_precision str Specify which quantization precision to use. Int8(default) is only supported precision for now.
quantization_precision_bias str Specify which quantization precision to use. Value options: Int8, Int32 (default)
enable_rowwise bool Enable rowwise quantization of FullyConnected and SparseLengthsSum ops.
enable_channelwise bool Enable channelwise quantization of Convolution op.
dump_profile str Perform quantization profiling for a given graph and dump result to the file. Compilation will be done after dumping profile unlike qaic-exec
load_profile str Load quantization profile file and quantize the graph. The profile file to be loaded is the one which is dumped through option -dump-profile.
convert_to_quantize bool If -load-profile option is not provided then input data is profiled and run in quantized mode. Default is off. Also set-quantization-* options as per requirement. Do not use this option along with -dump-profile or -load-profile.
load_embedding_tables str Load embedding tables from this zip file for DLRM and RecSys models.
dump_embedding_tables str Extract embedding tables from pytorch model and dump them in the zip file specified.
mdp_load_partition_config str Load config file for partitioning a graph across devices.
mdp_dump_partition_config str Dump config file for partitioning a graph across devices.
host_preproc bool Enable all pre-/post-processing on host
aic_preproc bool Disable all pre-/post-processing on host. Operations are performed on AI 100 instead.
aic_enable_depth_first bool Enables DFS with default memory size
aic_depth_first_mem int Sets DFS memory size. number must be choosen from [8,32]
stats_batchsize int This option is used to normalize performance statistics to be per inference
always_expand_onnx_functions bool This option forces the expansion ONNX functions.
enable_debug bool Enables debug mode during model compilation
time_passes bool Enables printing of compile-time statistics
io_crc bool Enables CRC check for inputs and outputs of the network.
io_crc_stride int Specifies size of stride to calculate CRC in the stride section
sdp_cluster_sizes (list[int]) Enables single device partitioning and sets the cluster configuration
profiling_threads int This option is used to assign the number of threads to use for for quantization profile generation
compile_threads int Sets the number of parallel threads used for compilation.
use_producer_dma bool Initiate NSP DMAs from the thread that produces data being transferred
aic_perf_warnings bool Print performance warning messages
aic_perf_metrics bool Print compiler performance metrics
aic_pmu_recipe str Enable the PMU selection based on built-in recipe: AxiRd, AxiWr, AxiRdWr, KernelUtil, HmxMacs
aic_pmu_events str Track events in NSP cores. Up to 8 events are supported
dynamic_shape_input (list[str]) Inform the compiler which inputs should be treated as having dynamic shape
multicast_weights bool Reduce DDR bandwidth by loading weights used on multiple-cores only once and multicasting to other cores.
ddr_stats bool Used to collect DDR traffic details at per core level.
combine_inputs bool When enabled combines inputs into fewer buffers for transfer to device.
combine_outputs bool When enabled combines outputs into a single buffer for transfer to host.
enable_metrics bool Set value to True if you are interested in getting performance metrics for inference runs on a session. (Can not be used if enable_profiling is set to True.)
enable_profiling bool Set value to True if you want to profile the inferences and get performance metrics for inference runs on a session. (Can not be used if enable_metrics is set to True.)

Returns

Session object.

Example

using options_path yaml file
sess = qaic.Session('/path/to/model', options_path = '/path/to/options.yaml') 
input_dict = {'input_name': input_data}
output = sess.run(input_dict)
sample contents of yaml file
aic_num_cores: 4
num_activations: 1
convert_to_fp16: true
onnx_define_symbol:
  batch: 1
output_dir: './resnet_qpc'
using keyword args
    sess = qaic.Session('/path/to/model_qpc/*.bin', num_activations=4, set_size=10)
    input_dict = {'input_name': input_data}
    output = sess.run(input_dict)

API List (Function variables for session object)

Session class has following methods.

backend_options()

Returns

A dict of options that can be configured after creating session

Usage example
backend_options_dict = session.backend_options()

get_metrics()

Returns

A dictionary containing the following metrics:

- num_of_inferences (int): The number of inferences.
- min_latency (float): The minimum inference time.
- max_latency (float): The maximum inference time.
- P25 (float): The 25th percentile latency.
- P50 (float): The 50th percentile latency (median).
- P75 (float): The 75th percentile latency.
- P90 (float): The 90th percentile latency.
- P99 (float): The 99th percentile latency.
- P999 (float): The 99.9th percentile latency.
- total_inference_time (float): The sum of individual insference times.
- avg_latency (float): The average latency.
Usage example
metrics_dict = session.get_metrics()

model_input_shape_dict()

Returns

A dict with input_name as key and input_shape, input_type as values

Usage example
input_shape_dict = session.model_input_shape_dict() 

model_output_shape_dict()

Returns

A dict with output_name as key and output_shape, output_type as values

Usage example
output_shape_dict = session.model_output_shape_dict()

Returns

None

Usage example
session.print_metrics() 
Note

This method assumes that either the 'enable_profiling' or 'enable_metrics' attribute is set to True.

Sample Output:

Number of inferences utilized for calculation are 999
Minimum latency observed 0.0009578340000000001 s
Maximum latency observed 0.002209001 s
Average latency / inference time observed is 0.0012380756316316324 s
P25 / 25% of inferences observed latency less than 0.001095435 s
P50 / 50% of inferences observed latency less than 0.0012522870000000001 s
P75 / 75% of inferences observed latency less than 0.001299786 s
P90 / 90% of inferences observed latency less than 0.002209001 s
P99 / 99% of inferences observed latency less than 0.0016082370000000002 s
Sum of all the inference times 1.2368375560000007 s
Average latency / inference time observed is 0.0012380756316316324 s

Returns

None

Usage example
session.print_profile_data(n)   

Print profiling data for the first n iterations

Note

This function only works when 'enable_profiling' is set to True for the Session.

  • This method assumes that the 'enable_profiling' attribute is set to True, and the 'profiling_results' attribute contains the profiling data for each iteration.
  • The method prints the profiling data in a tabular format, including the file, line, function, number of calls, function time (seconds), and total time (seconds) for each function.

Sample Output:

|  File-Line-Function  | |  num calls  | |  func time  | |  tot time  |

('~', 0, "<method 'astype' of 'numpy.ndarray' objects>") 1 0.000149101 0.000149101

('~', 0, '<built-in method numpy.empty>') 1 2.38e-06 2.38e-06

('~', 0, '<built-in method numpy.frombuffer>') 1 4.22e-06 4.22e-06

reset()

Returns

None

Usage example
session.reset() 

Releases all the device resources acquired by session

setup()

Returns

None

Usage example
session.setup() 

Loads the network to the device.

Network is usually loaded during first call of run. If this is called before that, network will be already loaded when first run is called.

run(input_dict)

Returns

A dict with output_name and output_value of inference

Usage example
output = session.run(input_dict)    

input_dict should have input_name as key and value should be a numpy array

run_benchmark()

Returns

inf_completed: Total number of inferences run inf_rate: Inf/Sec of the model inf_time: total time taken to run inferences batch_size: Batch Size used by model

Usage example
inf_completed, inf_rate, inf_time, batch_size = session.run_benchmark() 

It accepts following args:

num_inferences: num of inferences to run in benchmarking. Default 40 inf_time: duration for which inference is to be run in seconds. Default None input_dict: Input to be used in inference. Default random

Note

num_inferences and time cannot be used together.

This API uses C++ benchmarking APIs and doesn't take into account python overheads

update_backend_options(**kwargs)

Returns

None

Usage example
session.update_backend_options(num_activations = 2)     

Update option specified in kwargs

For example:

num_activation, dev_id, set_size can be configured with this API.