End-to-End Examples - qaic-compile and qaic-runner (CV + LLM)¶
This guide provides end-to-end examples demonstrating qaic-compile and
qaic-runner using the following reference models:
CV: ResNet50-v1-7
LLM: Llama-3.2-1B-Instruct
Workflows¶
Note
--aic-batch-json-input is not generally required at compile time.
It becomes relevant at compile time only in specific workflows where the
compiler needs representative inputs (e.g., quantization / calibration).
CV (ResNet50-v1-7)¶
Download the ResNet50-v1-7 model¶
wget https://github.com/onnx/models/raw/main/validated/vision/classification/resnet/model/resnet50-v1-7.tar.gz
tar -xzf resnet50-v1-7.tar.gz
Compile with qaic-compile¶
/opt/qti-aic/exec/qaic-compile \
-m=./resnet50-v1-7/resnet50-v1-7.onnx \
-onnx-define-symbol=N,1 \
-convert-to-fp16 \
-aic-hw \
-aic-num-cores=1 \
-aic-binary-dir=./compiler_output
Note
The downloaded model does not include a definition for the ONNX symbol N.
As a result, the batch dimension N must be specified explicitly.
Create the aic batch JSON file¶
To create the --aic-batch-json-input file, you need input raw files and the
qaic-qpc validate command output to identify runtime input/output binding
names and expected buffer sizes for the compiled model. For more details on
qaic-qpc, refer to qaic-qpc.
Generate input raw image¶
Input raw files can be generated using img2raw.py located at:
/opt/qti-aic/scripts/qaic-model-configurator/
Download an image:
wget -O cat_285.png https://github.com/pytorch/glow/raw/master/tests/images/imagenet/cat_285.png
Generate raw input:
python3 /opt/qti-aic/scripts/qaic-model-configurator/img2raw.py \
-image-dir ./ \
-image-type png \
-height 224 \
-width 224 \
-batchsize 1 \
-reuse-single-file \
-output ./
This typically generates ./batch_size_1/img_0.raw.
Identify runtime binding names and expected buffer sizes¶
sudo /opt/qti-aic/tools/qaic-qpc validate --qpc ./compiler_output/programqpc.bin
Record the runtime binding names and expected buffer sizes. For example (ResNet50):
Input binding name:
dataOutput binding name:
resnetv17_dense0_fwdData type: Float (FP32 -> 4 bytes). Refer to Supported Data Types and its values
Parameters¶
The following parameters must be specified for each input/output entry:
path: Relative path to the raw data file (input raw file for inputs; output destination for outputs).io-direction:infor inputs,outfor outputs.elem-size: Element size in bytes matching the binding data type. Refer to Supported Data Types and its valuesmap-to: Runtime binding name to map this buffer to.
Example: resnet50_ios.json¶
{
"IO-files": [
[
{
"path": "./batch_size_1/img_0.raw",
"io-direction": "in",
"elem-size": 4,
"map-to": "data"
},
{
"path": "resnet50_output_0.raw",
"io-direction": "out",
"elem-size": 4,
"map-to": "resnetv17_dense0_fwd"
}
]
]
}
For the full JSON format usage, refer to --aic-batch-json-input JSON Format.
Execute the network binary¶
/opt/qti-aic/exec/qaic-runner -t compiler_output \
--aic-batch-json-input ./resnet50_ios.json \
--write-output-start-iter 0 \
--write-output-num-samples 1 \
--write-output-dir ./outputs \
-n 10 -a 1 -d 0
LLM (Llama-3.2-1B-Instruct)¶
Model preparation¶
To compile an LLM using qaic-compile you need:
model.onnxspecializations.jsoncustom_io.yaml
These can be generated using,
Standard vLLM commands (pull/run) with pre-built Docker container. For more details, refer to vLLM Inference Server.
The Qefficient libraries.
Reference model used here: meta-llama/Llama-3.2-1B-Instruct.
Compile with qaic-compile¶
Example compile command:
/opt/qti-aic/exec/qaic-compile \
-m=/workspace/qeff_cache/LlamaForCausalLM/LlamaForCausalLM-810a822c8d59f4f7/LlamaForCausalLM.onnx \
-aic-hw \
-aic-hw-version=ai100 \
-network-specialization-config=/workspace/qeff_cache/LlamaForCausalLM/LlamaForCausalLM-810a822c8d59f4f7/qpc-891fe35443591dac/specializations.json \
-retained-state \
-convert-to-fp16 \
-aic-num-cores=16 \
-custom-IO-list-file=/workspace/qeff_cache/LlamaForCausalLM/LlamaForCausalLM-810a822c8d59f4f7/qpc-891fe35443591dac/custom_io.yaml \
-compile-only \
-aic-binary-dir=qpc/Llama-11kcl-bs1-16c
This generates the network binary under qpc/Llama-11kcl-bs1-16c.
Create the aic batch JSON file¶
To create the --aic-batch-json-input file, you need input raw files and the
qaic-qpc validate output to identify runtime binding names and expected
buffer sizes for the compiled model. For more details on qaic-qpc, refer to
qaic-qpc.
Generate input raw files¶
Example (Python):
import numpy as np
# Generates: input_ids.raw, position_ids.raw, batch_index.raw
np.array([1], dtype=np.int32).tofile("input_ids.raw")
np.array([0], dtype=np.int32).tofile("position_ids.raw")
np.array([0], dtype=np.int32).tofile("batch_index.raw")
KV cache inputs (first token)¶
For the initial inference, create zero-initialized raw files:
import numpy as np
shape = (1, 8, 11264, 66)
size = np.prod(shape)
for i in range(16):
np.zeros(size, dtype=np.int8).tofile(f"past_key.{i}.raw")
np.zeros(size, dtype=np.int8).tofile(f"past_value.{i}.raw")
Identify runtime binding names and expected buffer sizes¶
sudo /opt/qti-aic/tools/qaic-qpc validate --qpc /workspace/qpc/Llama-11kcl-bs1-16c/programqpc.bin
Parameters¶
The following parameters must be specified for each input/output entry:
path: Relative path to the raw data file.io-direction:infor inputs,outfor outputs.elem-size: Element size in bytes matching the binding data type. Refer to Supported Data Types and its values.map-to: Runtime binding name to map this buffer to.
Example: llm_ios.json (single token)¶
Below is a minimal example for one token decode, including KV cache handling.
{
"IO-files": [
[
{
"path": "input_ids.raw",
"io-direction": "in",
"elem-size": 4,
"map-to": "input_ids"
},
{
"path": "position_ids.raw",
"io-direction": "in",
"elem-size": 4,
"map-to": "position_ids"
},
{
"path": "batch_index.raw",
"io-direction": "in",
"elem-size": 4,
"map-to": "batch_index"
},
{
"path": "past_key.0.raw",
"io-direction": "in",
"elem-size": 1,
"map-to": "past_key.0"
},
{
"path": "past_value.0.raw",
"io-direction": "in",
"elem-size": 1,
"map-to": "past_value.0"
},
{
"path": "logits.raw",
"io-direction": "out",
"elem-size": 4,
"map-to": "logits"
},
{
"path": "past_key.0.next.raw",
"io-direction": "out",
"elem-size": 1,
"map-to": "past_key.0_RetainedState"
},
{
"path": "past_value.0.next.raw",
"io-direction": "out",
"elem-size": 1,
"map-to": "past_value.0_RetainedState"
}
]
]
}
Note
Update the file for N past key/values and retained outputs.
Repeat
past_key.N/past_value.Nfor N = 1..15Repeat retained outputs for N = 1..15
For the full JSON format usage, refer to --aic-batch-json-input JSON Format.
Multi-token decoding (multiple IO sets)¶
To decode multiple tokens:
Add multiple IO sets inside
"IO-files"Each IO set uses the previous step’s
_RetainedStateoutputs aspast_key.*/past_value.*inputsUpdate
input_idsandposition_idsper token
Execute with qaic-runner¶
/opt/qti-aic/exec/qaic-runner \
-t /workspace/qpc/Llama-11kcl-bs1-16c/ \
--aic-batch-json-input ./llm_ios.json \
--write-output-start-iter 0 \
--write-output-num-samples 1 \
--write-output-dir ./outputs \
-S 1 -n 10 -d 0