qaic package¶

qaic - qaic package provides a way to run inference on Qualcomm Cloud AI 100 card.

Description¶

A user can create a session object by passing in

with a .onnx file.
with a precompiled qpc as model_path, in case a user already has compiled qpc. Full path to qpc.bin should be passed while using precompiled binary.

Info

QPC : Qualcomm Program Container

Example on how to run an inference¶

option 1 : Compiles the onnx file to generate QPC and sets up session for inference

import qaic
import numpy as np
sess = qaic.Session('/path/to/model/model.onnx')
input_dict = {'input_name': input_data}
output = sess.run(input_dict)

option 2 : Uses generate QPC and sets up session for inference

import qaic
import numpy as np
sess = qaic.Session('/path/to/model/qpc.bin') # option 2 : Session uses compiled QPC file to
input_dict = {'input_name': input_data}
output = sess.run(input_dict)

Example for benchmarking¶

import qaic
sess = qaic.Session(model_path='/path/to/model', backend='aic', options_path = '/path/to/yaml') # model_path can be either onnx or precompiled qpc
inf_completed, inf_rate, inf_time, batch_size = sess.run_benchmark()

Limitations¶

Currently only QAic backend is supported. We plan support for QNN backend in future releases.
These APIs are supported only on x86-64 platforms

class Session¶

Session is the entry point of these APIs. Session is a factory method which user needs to call to create an instance of session. A model is compiled by default when creating a session.

Session(model_path, **kwargs)

Session creates a session object based on the qpc provided.

Parameters¶

Parameter	Type	Description
`model_path`	str	path to .onnx file or .bin file i.e. the compiled model QPC
`**kwargs`		refer to the keyword arguments listed below

Keyword Arguments	Type	Description
`dev_id`	int	Device on which to run the inference. Default is 0”
`num_activations`	int	Number of instances on network to be activated.
`set_size`	int	Number of ExecObj to be created.
`mos`	int	Effort level to reduce the on-chip memory.
`ols`	int	Factor to increasing splitting of network for parallelism
`aic_num_cores`	int	Number of aic cores to be used for inference
`convert_to_fp16`	bool	Run all floating-point in fp16
`onnx_define_symbol`	(list[tuple(str, int)])	Define an onnx symbol with its value
`output_dir`	str	Stores model binaries at directory location provided
`output_node_names`	(list[str])	Output node names should be in the order as present in model file. This option is mandatory for TF models
`model_inputs list`	dict	`Provide input node name with its data type and shape. Dict must contain keys ‘input_name’, ‘input_type’,’input_shape’. This is mandatory for pytorch models
`allocator_dealloac_delay`	int	Option to increase the lifetime of buffers to reduce false dependencies
`size_split_granularity`	int	Option to specify a maximum tile size target for operations that may be too large to execute out of fast memory. Tile size in KiB between 512 - 2048
`vtcm_working_set_limit_ratio`	float	Option to Specify the maximum ratio amount of fast memory to DDR any single instruction is allowed use of all available value between
`execute_nodes_in_fp16`	(list[str])	Run all instances of the operators in this list with FP16
`node_precision_info_file`	str	Load model loader precision file which contains first output name of operator instances required to be executed in FP16 or FP32.
`keep_original_precision_for_nodes`	(list[str])	Run all instances of the operators in this list with original precision during generation of quantized precision model even if the operator is supported in Int8 precision
`custom_io_list_file`	str	Custom I/O config file in yaml format containing layout, precision scale and offset for each input and output of the model.
`dump_custom_io_config_template_file`	str	Dumps the yaml template for Custom I/O configuration
`external_quantization_file`	str	Load the externally generated quantization profile
`quantization_schema_activations`	str	Specify which quantization schema to use for activations. Valid options: asymmetric, symmetric, symmetric_with_uint8 (default), symmetric_with_power2_scale
`quantization_schema_constants`	str	Specify which quantization schema to use for constants. Valid options: asymmetric, symmetric, symmetric_with_uint8
`quantization_calibration`	str	Specify which quantization calibration to use Default is None (MinMax calibration is applied). Valid options: None (default), KLMinimization, KLMinimizationV2, Percentile, MSE and SQNR.
`percentile_calibration_value`	float	Specify the percentile value to be used with Percentile calibration method. The specified float value must lie within 90 and 100, default: 100.
`num_histogram_bins`	int	Sets the num of histogram bins that will be used in profiling every node. Default value is 512
`quantization_precision`	str	Specify which quantization precision to use. Int8(default) is only supported precision for now.
`quantization_precision_bias`	str	Specify which quantization precision to use. Value options: Int8, Int32 (default)
`enable_rowwise`	bool	Enable rowwise quantization of FullyConnected and SparseLengthsSum ops.
`enable_channelwise`	bool	Enable channelwise quantization of Convolution op.
`dump_profile`	str	Perform quantization profiling for a given graph and dump result to the file. Compilation will be done after dumping profile unlike qaic-exec
`load_profile`	str	Load quantization profile file and quantize the graph. The profile file to be loaded is the one which is dumped through option -dump-profile.
`convert_to_quantize`	bool	If -load-profile option is not provided then input data is profiled and run in quantized mode. Default is off. Also set-quantization-* options as per requirement. Do not use this option along with -dump-profile or -load-profile.
`load_embedding_tables`	str	Load embedding tables from this zip file for DLRM and RecSys models.
`dump_embedding_tables`	str	Extract embedding tables from pytorch model and dump them in the zip file specified.
`mdp_load_partition_config`	str	Load config file for partitioning a graph across devices.
`mdp_dump_partition_config`	str	Dump config file for partitioning a graph across devices.
`host_preproc`	bool	Enable all pre-/post-processing on host
`aic_preproc`	bool	Disable all pre-/post-processing on host. Operations are performed on AI 100 instead.
`aic_enable_depth_first`	bool	Enables DFS with default memory size
`aic_depth_first_mem`	int	Sets DFS memory size. number must be chosen from [8,32]
`stats_batchsize`	int	This option is used to normalize performance statistics to be per inference
`always_expand_onnx_functions`	bool	This option forces the expansion ONNX functions.
`enable_debug`	bool	Enables debug mode during model compilation
`time_passes`	bool	Enables printing of compile-time statistics
`io_crc`	bool	Enables CRC check for inputs and outputs of the network.
`io_crc_stride`	int	Specifies size of stride to calculate CRC in the stride section
`sdp_cluster_sizes`	(list[int])	Enables single device partitioning and sets the cluster configuration
`profiling_threads`	int	This option is used to assign the number of threads to use for for quantization profile generation
`compile_threads`	int	Sets the number of parallel threads used for compilation.
`use_producer_dma`	bool	Initiate NSP DMAs from the thread that produces data being transferred
`aic_perf_warnings`	bool	Print performance warning messages
`aic_perf_metrics`	bool	Print compiler performance metrics
`aic_pmu_recipe`	str	Enable the PMU selection based on built-in recipe: AxiRd, AxiWr, AxiRdWr, KernelUtil, HmxMacs
`aic_pmu_events`	str	Track events in NSP cores. Up to 8 events are supported
`dynamic_shape_input`	(list[str])	Inform the compiler which inputs should be treated as having dynamic shape
`multicast_weights`	bool	Reduce DDR bandwidth by loading weights used on multiple-cores only once and multicasting to other cores.
`ddr_stats`	bool	Used to collect DDR traffic details at per core level.
`combine_inputs`	bool	When enabled combines inputs into fewer buffers for transfer to device.
`combine_outputs`	bool	When enabled combines outputs into a single buffer for transfer to host.
`enable_metrics`	bool	Set value to True if you are interested in getting performance metrics for inference runs on a session. (Can not be used if enable_profiling is set to True.)
`enable_profiling`	bool	Set value to True if you want to profile the inferences and get performance metrics for inference runs on a session. (Can not be used if enable_metrics is set to True.)

Returns¶

Session object.

Example¶

using options_path yaml file

sess = qaic.Session('/path/to/model', options_path = '/path/to/options.yaml')
input_dict = {'input_name': input_data}
output = sess.run(input_dict)

sample contents of yaml file

aic_num_cores: 4
num_activations: 1
convert_to_fp16: true
onnx_define_symbol:
  batch: 1
output_dir: './resnet_qpc'

using keyword args

sess = qaic.Session('/path/to/model_qpc/*.bin', num_activations=4, set_size=10)
input_dict = {'input_name': input_data}
output = sess.run(input_dict)

API List (Function variables for session object)¶

Session class has following methods.

backend_options()¶

Returns

A dict of options that can be configured after creating session

Usage example

backend_options_dict = session.backend_options()

get_metrics()¶

Returns

A dictionary containing the following metrics:

- num_of_inferences (int): The number of inferences.
- min_latency (float): The minimum inference time.
- max_latency (float): The maximum inference time.
- P25 (float): The 25th percentile latency.
- P50 (float): The 50th percentile latency (median).
- P75 (float): The 75th percentile latency.
- P90 (float): The 90th percentile latency.
- P99 (float): The 99th percentile latency.
- P999 (float): The 99.9th percentile latency.
- total_inference_time (float): The sum of individual inference times.
- avg_latency (float): The average latency.

Usage example

metrics_dict = session.get_metrics()

model_input_shape_dict()¶

Returns

A dict with input_name as key and input_shape, input_type as values

Usage example

input_shape_dict = session.model_input_shape_dict()

model_output_shape_dict()¶

Returns

A dict with output_name as key and output_shape, output_type as values

Usage example

output_shape_dict = session.model_output_shape_dict()

print_metrics()¶

Returns

None

Usage example

session.print_metrics()

Info

This method assumes that either the ‘enable_profiling’ or ‘enable_metrics’ attribute is set to True.

Sample Output:

Number of inferences utilized for calculation are 999
Minimum latency observed 0.0009578340000000001 s
Maximum latency observed 0.002209001 s
Average latency / inference time observed is 0.0012380756316316324 s
P25 / 25% of inferences observed latency less than 0.001095435 s
P50 / 50% of inferences observed latency less than 0.0012522870000000001 s
P75 / 75% of inferences observed latency less than 0.001299786 s
P90 / 90% of inferences observed latency less than 0.002209001 s
P99 / 99% of inferences observed latency less than 0.0016082370000000002 s
Sum of all the inference times 1.2368375560000007 s
Average latency / inference time observed is 0.0012380756316316324 s

print_profile_data¶

Returns

None

Usage example

session.print_profile_data(n)

Print profiling data for the first n iterations

Info

This function only works when ‘enable_profiling’ is set to True for the Session.

This method assumes that the ‘enable_profiling’ attribute is set to True, and the ‘profiling_results’ attribute contains the profiling data for each iteration.
The method prints the profiling data in a tabular format, including the file, line, function, number of calls, function time (seconds), and total time (seconds) for each function.

Sample Output:

|  File-Line-Function  | |  num calls  | |  func time  | |  tot time  |

('~', 0, "<method 'astype' of 'numpy.ndarray' objects>") 1 0.000149101 0.000149101

('~', 0, '<built-in method numpy.empty>') 1 2.38e-06 2.38e-06

('~', 0, '<built-in method numpy.frombuffer>') 1 4.22e-06 4.22e-06

reset()¶

Returns

None

Usage example

session.reset()

Releases all the device resources acquired by session

setup()¶

Returns

None

Usage example

session.setup()

Loads the network to the device.

Network is usually loaded during first call of run. If this is called before that, network will be already loaded when first run is called.

run(input_dict)¶

Returns

A dict with output_name and output_value of inference

Usage example

output = session.run(input_dict)

input_dict should have input_name as key and value should be a numpy array

run_benchmark()¶

Returns

inf_completed: Total number of inferences run inf_rate: Inf/Sec of the model inf_time: total time taken to run inferences batch_size: Batch Size used by model

Usage example

inf_completed, inf_rate, inf_time, batch_size = session.run_benchmark()

It accepts following args:

num_inferences: num of inferences to run in benchmarking. Default 40 inf_time: duration for which inference is to be run in seconds. Default None input_dict: Input to be used in inference. Default random

Note

num_inferences and time cannot be used together.

This API uses C++ benchmarking APIs and doesn’t take into account python overheads

update_backend_options(**kwargs)¶

Returns

None

Usage example

session.update_backend_options(num_activations = 2)

Update option specified in kwargs

For example:

num_activation, dev_id, set_size can be configured with this API.