vLLM arguments for QAIC

vLLM input arguments

Input Arg

Default Value

Setting Required for Qaic runs

model

None

Hugging face model name or model path

max-num-seqs

256

Decode batch size

max-model-len

2048

Context length

max-seq-len-to-capture

None

Sequence length

device

“auto”

“auto” or “qaic” - Qualcomm AI cloud devices will be used, if VLLM is installed correctly for qaic Note: This is only applicable to vLLM v0.8.5

device-group

[0]

List of device ids to be used for execution; Ultra - 0,1,2,3; Ultra+ - 0,1,2,3,4,5,6,7

quantization

“auto”

“auto” - No weight quantization (FP16); “mxfp6” - Weight are quantized with mxfp6

kv-cache-dtype

“auto”

“auto” - No KV Cache compression (FP16); “mxint8” - KV Cache compressed using mxint8 format

disable-log-stats

True

True - print performance stats; False - disable performance stats

num-gpu-blocks-override

“max-num-seqs”

This is a user-configurable and controls how much KV cache memory exists.

tensor-parallel-size

1

vLLM implementation of Tensor slicing using collective communication library is not supported, instead Tensor Slicing is supported inherently using QAIC AOT approach. To use TS>1 please provide right set of device_ids in device_group arguments. It is recommend not to enable vLLM default TS implementation using “- tensor-parallel-size” argument

enable-chunked-prefill

False

Chunked prefill is supported by default in QAIC model runner implemention, not using default chunking logic in vLLM scheduler class, thus it is recommended not to enable chunking using “-en able-chunked-prefill” argument

enable-prefix-caching

False

Set this flag to True to enable prefix caching for qaic

override-qaic-config

None

Initialize non default qaic config or override default qaic config that are specific to Qaic devices, for speculative draft model, this argument will be used to configure the qaic config that can not be fully gathered from the vLLM arguments

speculative-config

Configuration for speculative decoding.

task

The task to use the model for. Each vLLM instance only supports one task, even if the same model can be used for multiple tasks. When the model only supports one task, “auto” can be used to select it; otherwise, you must specify explicitly which task to use.

override-pooler-config

Initialize non default pooling config or override default pooling config for the pooling model.

Override Arguments that can be modified

override-qaic-config = <Compiler cfg for target model>

Using this interface user can override default attributes such as, - num_cores, dfs, mos, device_group, qpc_path, mxfp6_matmul, mxint8_kv_cache, device_group, and other compiler options.

CLI inferencing

Single space between attributes, no space within attribute and value pair during running on command line arguments.

Example

--override-qaic-config = "num_cores=4 mxfp6_matmul=True mos=1 device_group=0,1,2,3"

Note: Only provide attributes which need to be overridden.

Python object inferencing

Override arguments can also be passed as input during LLM object creation.

Example

override_qaic_config = {'num_cores':8, 'mxfp6_matmul':True,  'mos':1}

Note: Only provide attributes which need to be overridden.

All qaic-compile arguments can be passed as input parameters. The list below describes the supported arguments and their corresponding descriptions.

Input Argument

Default Value

Description

num_cores / aic_num_cores

16

Specifies the number of NSP cores to use. Defaults to 8 for SpD draft models when the speculative configuration uses the same device group as the target model.

dfs / aic_enable_depth_first

True

Enables depth‑first scheduling. Set “dfs=false” to disable.

mos

1

Degree of weight splitting across cores to reduce on‑chip memory usage.

num_devices

None

Number of devices to use. Users should specify either “num_devices” or “device_group”. In auto‑device mode, “device_group” is not required.

mdts_mos

None

Degree of weight splitting across multi‑device tensor slices to improve memory usage and compute efficiency.

mxint8 / mxint8_en / mxint8_kv_cache

False

Enables MXINT8 quantization for KV cache or MDP IO traffic compression. Recommended to use vLLM args “–kv-cache-dtype” instead when available.

mxfp6 / mxfp6_matmul / mxfp6_en

False

Enables MXFP6 (E2M3) quantization for constant MatMul weights to reduce memory traffic at the cost of slightly more compute. Prefer vLLM “–quantization=mxfp6”.

device_group

None

List of device IDs used for execution. Ultra: “0,1,2,3” Use either “device_group” or “num_devices”.

embed_seq_len

None

List of model lengths. Compiler generates a single QPC supporting multiple lengths, allowing vLLM to switch QPCs dynamically.

comp_ctx_lengths_prefill

None

List of context lengths for prefill stage. Enables multi‑length binaries and CCL support for higher performance.

comp_ctx_lengths_decode

None

List of context lengths for decode stage. Enables multi‑length binaries and CCL support for higher performance.

ccl_enabled

False

Explicitly enables CCL. If not specified, optimized CCL lists are automatically generated when context length lists are not provided.

num_patches

None

Used to compile Vision‑Language models based on the number of image patches.

height

None

Height of the image for which vision and language binaries are compiled.

width

None

Width of the image for which vision and language binaries are compiled.

aic_include_sampler

False

Enables on‑device sampling.

max_top_k_ids

512

Maximum top‑k value compiled into the sampler binary.

aic_include_guided_decoding

False

Enables guided decoding. Applicable only when on‑device sampling is enabled.

kv_offload

False

Enables KV cache offload.

skip_lang

False

Used in dual‑QPC compilation. When set, language/text binary compilation is skipped.

skip_vision

False

Used in dual‑QPC compilation. When set, vision binary compilation is skipped.

pooling_device

None

Device on which pooling runs. Must be “qaic” or “cpu””. Required for pooled outputs.

pooling_method

None

Pooling method when “pooling_device=qaic”. Supported: “mean”, “avg”, “cls”, “max” or custom poolers.

normalize

False

Normalizes pooled outputs when using “pooling_device=qaic”.

softmax

False

Applies softmax to pooled outputs when using “pooling_device=qaic”.

prefill_only

None

Used for disaggregated serving. “True” → compile prefill QPC only “False” → compile decode QPC only “None” → single QPC for both stages

vLLM flags and environment variables

Input Arg

Default Value

Setting Required for Qaic runs

VLLM_QAIC_QPC_PATH

None

Set this flag with the path to qpc. vLLM loads the qpc directly from the path provided and will not compile the model

VLLM_QAIC_MOS

None

Set MOS value

VLLM_QAIC_DFS_EN

None

Enable compiler depth first

VLLM_QAIC_QID

None

Manually set QID for qaic devices

VLLM_QAIC_NUM_CORES

None

Set num_cores example 14 or 16

VLLM_QAIC_COMPILER_ARGS

None

Set additional compiler arguments through this environment variable

VLLM_QAIC_MAX_CPU_THREADS

None

Avoid oversubscription of CPU threads, during multi-instance execution. By default there is no limit, if user set an environment variable VLLM_ QAIC_MAX_CPU_THREADS, then number of cpu thread running pytorch sampling on cpu is limited, to avoid over-subscription. The contention is amplified when running in a container where CPU limits can cause throttling.

Avoiding CPU oversubscription via VLLM_QAIC_MAX_CPU_THREADS

CPU oversubscription refers to a situation where the total number of CPUs allocated to a system exceeds the total number of CPUs available on the hardware. This leads to severe contention for CPU resources. In such cases, there is frequent switching between processes, which increases processes switching overhead and decreases overall system efficiency. In containers where multiple instances of vLLM can cause oversubscription, limiting the concurrent CPU threads is a good way to avoid oversubscription.

Example

export VLLM_QAIC_MAX_CPU_THREADS=8
export OMP_NUM_THREADS=8