Tutorial

Example on how to compile and execute a non-LLM

Below is an example on compiling and executing a Non-LLM model i.e. BertLarge model with fp16 precision

  • Convert the input model network BERT_MLCommons_Flexible_BS_SL.onnx with

    • sequence length as 128

    • batch size as 4

    • bitwidth as 16 for float tensors

    • float bias bitwidth as 32

    • with the output as model.dlc

qairt-converter --input_network BERT_MLCommons_Flexible_BS_SL.onnx --output_path model.dlc --onnx_define_symbol batch_size 4 --onnx_define_symbol seg_length 128 --onnx_skip_simplification --float_bias_bitwidth 32 --preserve_io_datatype --onnx_defer_loading --float_bitwidth 16
  • Create a context binary with input as model.dlc created in the converter stage

    • With output as qnnGraphDLC.bin

qnn-context-binary-generator --binary_file qnnGraphDLC --model libQnnModelDlc.so --backend libQnnAic.so --output_dir OUTPUT_PATH --config_file qnn_config.json --dlc_path model.dlc --log_level debug
  • qnn_config.json contents:

    qnn_config.json
    {
      "backend_extensions": {
        "config_file_path": "qnn_aic_map.json",
        "shared_library_path": "libQnnAicNetRunExtensions.so"
      }
    }
    
  • qnn_aic_map.json contents:

    qnn_aic_map.json
    {
      "compiler_VTCM_working_set_limit_ratio": 1.0,
      "compiler_hardware_version": "2.0",
      "compiler_num_of_cores": 2,
      "compiler_perfWarnings": true,
      "compiler_printPerfMetrics": false,
      "compiler_stat_level": 40,
      "compiler_stats_batch_size": 4,
      "graph_names": [
          "model"
      ]
    }
    
  • Execute the qnnGraphDLC.bin the context-binary stage to run inference on the Cloud AI backend

qnn-net-run --backend libQnnAic.so --input_list qnn_list.txt --log_level error --profiling_level basic --retrieve_context qnnGraphDLC.bin --config_file qnn_net_runner_config.json --use_native_input_files
  • qnn_list.txt contents:

    qnn_list.txt
    input_ids:=input_ids.raw
    input_mask:=input_mask.raw
    segment_ids:=segment_ids.raw
    
  • qnn_net_runner_config.json contents

    qnn_net_runner_config.json
    {
      "backend_extensions": {
        "config_file_path": "qnn_net_runner_backend_options.json",
        "shared_library_path": "libQnnAicNetRunExtensions.so"
      }
    }
    
  • qnn_net_runner_backend_options.json contents

    qnn_net_runner_backend_options.json
    {
     "runtime_num_activations": 7,
     "runtime_threads_per_queue": 4,
     "runtime_device_ids": [5]
    }
    

Example on how to compile and execute LLM

Below is an example on compiling and executing a LLM model i.e. gpt-j-6b model

  • Convert the model gpt-j-6b.onnx with

  • batch size 1

  • sequence length 128

  • Context length 256

  • Output is model.dlc

qairt-converter --input_network gpt-j-6b.onnx --output_path model.dlc --io_config qnn_customIO_config.yaml --float_bitwidth 16 --preserve_io_datatype --onnx_skip_simplification --float_bias_bitwidth 32 --onnx_defer_loading
  • qnn_customIO_config.yaml file Sample Schema:

    qnn_customIO_config.yaml
    Input Tensor Configuration:
    - Name: input_ids
      Desired Model Parameters:
      DataType: int64
      Shape: (1, 128),(1, 1) →  (prefill_batch_size,prefill_seq_len)(decode_batch_size,decode_seq_len)
    - Name: position_ids
      Desired Model Parameters:
      DataType: int64
      Shape: (1, 128),(1, 1) →  (prefill_batch_size,prefill_seq_len)(decode_batch_size,decode_seq_len)
    
    - Name: past_key.0
      Desired Model Parameters:
      DataType: uint8
      Shape: (1, 16, 256, 256) → (batch_size,16,seq_len,256)
    - Name: past_value.0
      Desired Model Parameters:
      DataType: uint8
      Shape: (1, 16, 256, 256)
    
    Output Tensor Configuration:
    - Name: logits
      Desired Model Parameters:
      DataType: float32
    - Name: past_key.0_RetainedState
      Desired Model Parameters:
      DataType: float16
    - Name: past_value.0_RetainedState
      Desired Model Parameters:
      DataType: float16
    
  • Generate context binary with model.dlc as input which was created at converter stage

  • qnngraph.serialized.bin and programqpc.bin are generated as output

qnn-context-binary-generator --binary_file qnngraph.serialized --backend libQnnAic.so --output_dir OUTPUT_PATH --config_file qnn_config.json --data_format_config data_format_config.json --backend_binary programqpc.bin --model libQnnModelDlc.so --dlc_path model.dlc
  • qnn_config.json contents:

    qnn_config.json
    {
      "backend_extensions": {
       "config_file_path": "qnn_aic_map.json",
       "shared_library_path": "libQnnAicNetRunExtensions.so"
      }
    }
    
  • qnn_aic_map.json contents:

    qnn_aic_map.json
    {
      "compiler_PMU_recipe_opt": "KernelUtil",
      "compiler_compilation_target": "hardware",
      "compiler_convert_to_FP16": true,
      "compiler_do_DDR_to_multicast": false,
      "compiler_enable_depth_first": true,
      "compiler_hardware_version": "2.0",
      "compiler_max_out_channel_split": "1",
      "compiler_mdp_load_partition_config": tensor_slicing.json",
      "compiler_mxfp6_matmul_weights": true,
      "compiler_eval_gathernd_consts" : true,
      "compiler_mxint8_mdp_io": true,
      "compiler_num_of_cores": 16,
      "compiler_perfWarnings": true,
      "compiler_printDDRStats": true,
      "compiler_printPerfMetrics": true,
      "compiler_retained_state": true,
      "compiler_stat_level": 50,
      "compiler_stats_batch_size": 1,
      "compiler_time_passes": true,
      "graph_names": [
        "model_configuration_1",
        "model_configuration_2"
      ]
    }
    
  • tensor_slicing.json below contains partition across 4 devices when using multi device partitioning feature

    tensor_slicing.json
    {
      "connections": [
        {
            "devices": [
                0,
                1,
                2,
                3
            ],
            "type": "p2p"
        }
      ],
      "partitions": [
        {
            "name": "Partition0",
            "devices": [
                {
                    "deviceId": 0
                },
                {
                    "deviceId": 1
                },
                {
                    "deviceId": 2
                },
                {
                    "deviceId": 3
                }
            ]
          }
        ]
      }
    
  • data_format_config.json below is needed for seting KVCache precision as MXINT8

    data_format_config.json
    {
      "graphs": [
        {
          "graph_name": "model_configuration_1",
          "tensors": [
            {
              "tensor_name": "past_key_0",
              "dataFormat": "QNN_TENSOR_DATA_FORMAT_MX"
            },
            {
              "tensor_name": "past_value_0",
              "dataFormat": "QNN_TENSOR_DATA_FORMAT_MX"
            },
            {
              "tensor_name": "past_key_0_RetainedState",
              "dataFormat": "QNN_TENSOR_DATA_FORMAT_MX"
            },
            {
              "tensor_name": "past_value_0_RetainedState",
              "dataFormat": "QNN_TENSOR_DATA_FORMAT_MX"
            }
          ]
        },
        {
          "graph_name": "model_configuration_2",
          "tensors": [
            {
              "tensor_name": "past_key_0",
              "dataFormat": "QNN_TENSOR_DATA_FORMAT_MX"
            },
            {
              "tensor_name": "past_value_0",
              "dataFormat": "QNN_TENSOR_DATA_FORMAT_MX"
            },
            {
              "tensor_name": "past_key_0_RetainedState",
              "dataFormat": "QNN_TENSOR_DATA_FORMAT_MX"
            },
            {
              "tensor_name": "past_value_0_RetainedState",
              "dataFormat": "QNN_TENSOR_DATA_FORMAT_MX"
            }
          ]
        }
      ]
    }
    
  • Using the programqpc.bin generated above execute the LLM using vLLM