End-to-End Examples - `qaic-compile` and `qaic-runner` (CV + LLM)¶

This guide provides end-to-end examples demonstrating qaic-compile and qaic-runner using the following reference models:

CV: ResNet50-v1-7
LLM: Llama-3.2-1B-Instruct

Workflows¶

CV workflow (ResNet50-v1-7)
LLM workflow (Llama-3.2-1B-Instruct)
Create the aic batch JSON file (ResNet50 example)
Create the aic batch JSON file (Llama-3.2-1B example)

Note

--aic-batch-json-input is not generally required at compile time. It becomes relevant at compile time only in specific workflows where the compiler needs representative inputs (e.g., quantization / calibration).

CV (ResNet50-v1-7)¶

Download the ResNet50-v1-7 model¶

wget https://github.com/onnx/models/raw/main/validated/vision/classification/resnet/model/resnet50-v1-7.tar.gz
tar -xzf resnet50-v1-7.tar.gz

Compile with `qaic-compile`¶

/opt/qti-aic/exec/qaic-compile \
  -m=./resnet50-v1-7/resnet50-v1-7.onnx \
  -onnx-define-symbol=N,1 \
  -convert-to-fp16 \
  -aic-hw \
  -aic-num-cores=1 \
  -aic-binary-dir=./compiler_output

Note

The downloaded model does not include a definition for the ONNX symbol N. As a result, the batch dimension N must be specified explicitly.

Create the aic batch JSON file¶

To create the --aic-batch-json-input file, you need input raw files and the qaic-qpc validate command output to identify runtime input/output binding names and expected buffer sizes for the compiled model. For more details on qaic-qpc, refer to qaic-qpc.

Generate input raw image¶

Input raw files can be generated using img2raw.py located at:

/opt/qti-aic/scripts/qaic-model-configurator/

Download an image:

wget -O cat_285.png https://github.com/pytorch/glow/raw/master/tests/images/imagenet/cat_285.png

Generate raw input:

python3 /opt/qti-aic/scripts/qaic-model-configurator/img2raw.py \
  -image-dir ./ \
  -image-type png \
  -height 224 \
  -width 224 \
  -batchsize 1 \
  -reuse-single-file \
  -output ./

This typically generates ./batch_size_1/img_0.raw.

Identify runtime binding names and expected buffer sizes¶

sudo /opt/qti-aic/tools/qaic-qpc validate --qpc ./compiler_output/programqpc.bin

Record the runtime binding names and expected buffer sizes. For example (ResNet50):

Input binding name: data
Output binding name: resnetv17_dense0_fwd
Data type: Float (FP32 -> 4 bytes). Refer to Supported Data Types and its values

Parameters¶

The following parameters must be specified for each input/output entry:

path: Relative path to the raw data file (input raw file for inputs; output destination for outputs).
io-direction: in for inputs, out for outputs.
elem-size: Element size in bytes matching the binding data type. Refer to Supported Data Types and its values
map-to: Runtime binding name to map this buffer to.

Example: `resnet50_ios.json`¶

{
  "IO-files": [
    [
      {
        "path": "./batch_size_1/img_0.raw",
        "io-direction": "in",
        "elem-size": 4,
        "map-to": "data"
      },
      {
        "path": "resnet50_output_0.raw",
        "io-direction": "out",
        "elem-size": 4,
        "map-to": "resnetv17_dense0_fwd"
      }
    ]
  ]
}

For the full JSON format usage, refer to --aic-batch-json-input JSON Format.

Execute the network binary¶

/opt/qti-aic/exec/qaic-runner -t compiler_output \
  --aic-batch-json-input ./resnet50_ios.json \
  --write-output-start-iter 0 \
  --write-output-num-samples 1 \
  --write-output-dir ./outputs \
  -n 10 -a 1 -d 0

LLM (Llama-3.2-1B-Instruct)¶

Model preparation¶

To compile an LLM using qaic-compile you need:

model.onnx
specializations.json
custom_io.yaml

These can be generated using,

Standard vLLM commands (pull/run) with pre-built Docker container. For more details, refer to vLLM Inference Server.
The Qefficient libraries.

Reference model used here: meta-llama/Llama-3.2-1B-Instruct.

Compile with `qaic-compile`¶

Example compile command:

/opt/qti-aic/exec/qaic-compile \
  -m=/workspace/qeff_cache/LlamaForCausalLM/LlamaForCausalLM-810a822c8d59f4f7/LlamaForCausalLM.onnx \
  -aic-hw \
  -aic-hw-version=ai100 \
  -network-specialization-config=/workspace/qeff_cache/LlamaForCausalLM/LlamaForCausalLM-810a822c8d59f4f7/qpc-891fe35443591dac/specializations.json \
  -retained-state \
  -convert-to-fp16 \
  -aic-num-cores=16 \
  -custom-IO-list-file=/workspace/qeff_cache/LlamaForCausalLM/LlamaForCausalLM-810a822c8d59f4f7/qpc-891fe35443591dac/custom_io.yaml \
  -compile-only \
  -aic-binary-dir=qpc/Llama-11kcl-bs1-16c

This generates the network binary under qpc/Llama-11kcl-bs1-16c.

Create the aic batch JSON file¶

To create the --aic-batch-json-input file, you need input raw files and the qaic-qpc validate output to identify runtime binding names and expected buffer sizes for the compiled model. For more details on qaic-qpc, refer to qaic-qpc.

Generate input raw files¶

Example (Python):

import numpy as np

# Generates: input_ids.raw, position_ids.raw, batch_index.raw
np.array([1], dtype=np.int32).tofile("input_ids.raw")
np.array([0], dtype=np.int32).tofile("position_ids.raw")
np.array([0], dtype=np.int32).tofile("batch_index.raw")

KV cache inputs (first token)¶

For the initial inference, create zero-initialized raw files:

import numpy as np

shape = (1, 8, 11264, 66)
size = np.prod(shape)

for i in range(16):
    np.zeros(size, dtype=np.int8).tofile(f"past_key.{i}.raw")
    np.zeros(size, dtype=np.int8).tofile(f"past_value.{i}.raw")

Identify runtime binding names and expected buffer sizes¶

sudo /opt/qti-aic/tools/qaic-qpc validate --qpc /workspace/qpc/Llama-11kcl-bs1-16c/programqpc.bin

Parameters¶

The following parameters must be specified for each input/output entry:

path: Relative path to the raw data file.
io-direction: in for inputs, out for outputs.
elem-size: Element size in bytes matching the binding data type. Refer to Supported Data Types and its values.
map-to: Runtime binding name to map this buffer to.

Example: `llm_ios.json` (single token)¶

Below is a minimal example for one token decode, including KV cache handling.

{
  "IO-files": [
    [
      {
        "path": "input_ids.raw",
        "io-direction": "in",
        "elem-size": 4,
        "map-to": "input_ids"
      },
      {
        "path": "position_ids.raw",
        "io-direction": "in",
        "elem-size": 4,
        "map-to": "position_ids"
      },
      {
        "path": "batch_index.raw",
        "io-direction": "in",
        "elem-size": 4,
        "map-to": "batch_index"
      },
      {
        "path": "past_key.0.raw",
        "io-direction": "in",
        "elem-size": 1,
        "map-to": "past_key.0"
      },
      {
        "path": "past_value.0.raw",
        "io-direction": "in",
        "elem-size": 1,
        "map-to": "past_value.0"
      },
      {
        "path": "logits.raw",
        "io-direction": "out",
        "elem-size": 4,
        "map-to": "logits"
      },
      {
        "path": "past_key.0.next.raw",
        "io-direction": "out",
        "elem-size": 1,
        "map-to": "past_key.0_RetainedState"
      },
      {
        "path": "past_value.0.next.raw",
        "io-direction": "out",
        "elem-size": 1,
        "map-to": "past_value.0_RetainedState"
      }
    ]
  ]
}

Note

Update the file for N past key/values and retained outputs.

Repeat past_key.N / past_value.N for N = 1..15
Repeat retained outputs for N = 1..15

For the full JSON format usage, refer to --aic-batch-json-input JSON Format.

Multi-token decoding (multiple IO sets)¶

To decode multiple tokens:

Add multiple IO sets inside "IO-files"
Each IO set uses the previous step’s _RetainedState outputs as past_key.* / past_value.* inputs
Update input_ids and position_ids per token

Execute with `qaic-runner`¶

/opt/qti-aic/exec/qaic-runner \
  -t /workspace/qpc/Llama-11kcl-bs1-16c/ \
  --aic-batch-json-input ./llm_ios.json \
  --write-output-start-iter 0 \
  --write-output-num-samples 1 \
  --write-output-dir ./outputs \
  -S 1 -n 10 -d 0

End-to-End Examples - qaic-compile and qaic-runner (CV + LLM)¶

Workflows¶

CV (ResNet50-v1-7)¶

Download the ResNet50-v1-7 model¶

Compile with qaic-compile¶

Create the aic batch JSON file¶

Generate input raw image¶

Identify runtime binding names and expected buffer sizes¶

Parameters¶

Example: resnet50_ios.json¶

Execute the network binary¶

LLM (Llama-3.2-1B-Instruct)¶

Model preparation¶

Compile with qaic-compile¶

Create the aic batch JSON file¶

Generate input raw files¶

KV cache inputs (first token)¶

Identify runtime binding names and expected buffer sizes¶

Parameters¶

Example: llm_ios.json (single token)¶

Multi-token decoding (multiple IO sets)¶

Execute with qaic-runner¶

End-to-End Examples - `qaic-compile` and `qaic-runner` (CV + LLM)¶

Compile with `qaic-compile`¶

Example: `resnet50_ios.json`¶

Compile with `qaic-compile`¶

Example: `llm_ios.json` (single token)¶

Execute with `qaic-runner`¶