vLLM arguments for QAIC¶

vLLM input arguments¶

Input Arg	Default Value	Setting Required for Qaic runs
model	None	Hugging face model name or model path
max-num-seqs	256	Decode batch size
max-model-len	2048	Context length
max-seq-len-to-capture	None	Sequence length
device	“auto”	“auto” or “qaic” - Qualcomm AI cloud devices will be used, if VLLM is installed correctly for qaic Note: This is only applicable to vLLM v0.8.5
device-group	[0]	List of device ids to be used for execution; Ultra - 0,1,2,3; Ultra+ - 0,1,2,3,4,5,6,7
quantization	“auto”	“auto” - No weight quantization (FP16); “mxfp6” - Weight are quantized with mxfp6
kv-cache-dtype	“auto”	“auto” - No KV Cache compression (FP16); “mxint8” - KV Cache compressed using mxint8 format
disable-log-stats	True	True - print performance stats; False - disable performance stats
num-gpu-blocks-override	“max-num-seqs”	This is a user-configurable and controls how much KV cache memory exists.
tensor-parallel-size	1	vLLM implementation of Tensor slicing using collective communication library is not supported, instead Tensor Slicing is supported inherently using QAIC AOT approach. To use TS>1 please provide right set of device_ids in device_group arguments. It is recommend not to enable vLLM default TS implementation using “- tensor-parallel-size” argument
enable-chunked-prefill	False	Chunked prefill is supported by default in QAIC model runner implemention, not using default chunking logic in vLLM scheduler class, thus it is recommended not to enable chunking using “-en able-chunked-prefill” argument
enable-prefix-caching	False	Set this flag to True to enable prefix caching for qaic
override-qaic-config	None	Initialize non default qaic config or override default qaic config that are specific to Qaic devices, for speculative draft model, this argument will be used to configure the qaic config that can not be fully gathered from the vLLM arguments
speculative-config		Configuration for speculative decoding.
task		The task to use the model for. Each vLLM instance only supports one task, even if the same model can be used for multiple tasks. When the model only supports one task, “auto” can be used to select it; otherwise, you must specify explicitly which task to use.
override-pooler-config		Initialize non default pooling config or override default pooling config for the pooling model.

Override Arguments that can be modified¶

override-qaic-config = <Compiler cfg for target model>

Using this interface user can override default attributes such as, - num_cores, dfs, mos, device_group, qpc_path, mxfp6_matmul, mxint8_kv_cache, device_group, and other compiler options.

CLI inferencing¶

Single space between attributes, no space within attribute and value pair during running on command line arguments.

Example

--override-qaic-config = "num_cores=4 mxfp6_matmul=True mos=1 device_group=0,1,2,3"

Note: Only provide attributes which need to be overridden.

Python object inferencing¶

Override arguments can also be passed as input during LLM object creation.

Example

override_qaic_config = {'num_cores':8, 'mxfp6_matmul':True,  'mos':1}

Note: Only provide attributes which need to be overridden.

All qaic-compile arguments can be passed as input parameters. The list below describes the supported arguments and their corresponding descriptions.

Input Argument	Default Value	Description
num_cores / aic_num_cores	16	Specifies the number of NSP cores to use. Defaults to 8 for SpD draft models when the speculative configuration uses the same device group as the target model.
dfs / aic_enable_depth_first	True	Enables depth‑first scheduling. Set “dfs=false” to disable.
mos	1	Degree of weight splitting across cores to reduce on‑chip memory usage.
num_devices	None	Number of devices to use. Users should specify either “num_devices” or “device_group”. In auto‑device mode, “device_group” is not required.
mdts_mos	None	Degree of weight splitting across multi‑device tensor slices to improve memory usage and compute efficiency.
mxint8 / mxint8_en / mxint8_kv_cache	False	Enables MXINT8 quantization for KV cache or MDP IO traffic compression. Recommended to use vLLM args “–kv-cache-dtype” instead when available.
mxfp6 / mxfp6_matmul / mxfp6_en	False	Enables MXFP6 (E2M3) quantization for constant MatMul weights to reduce memory traffic at the cost of slightly more compute. Prefer vLLM “–quantization=mxfp6”.
device_group	None	List of device IDs used for execution. Ultra: “0,1,2,3” Use either “device_group” or “num_devices”.
embed_seq_len	None	List of model lengths. Compiler generates a single QPC supporting multiple lengths, allowing vLLM to switch QPCs dynamically.
comp_ctx_lengths_prefill	None	List of context lengths for prefill stage. Enables multi‑length binaries and CCL support for higher performance.
comp_ctx_lengths_decode	None	List of context lengths for decode stage. Enables multi‑length binaries and CCL support for higher performance.
ccl_enabled	False	Explicitly enables CCL. If not specified, optimized CCL lists are automatically generated when context length lists are not provided.
num_patches	None	Used to compile Vision‑Language models based on the number of image patches.
height	None	Height of the image for which vision and language binaries are compiled.
width	None	Width of the image for which vision and language binaries are compiled.
aic_include_sampler	False	Enables on‑device sampling.
max_top_k_ids	512	Maximum top‑k value compiled into the sampler binary.
aic_include_guided_decoding	False	Enables guided decoding. Applicable only when on‑device sampling is enabled.
kv_offload	False	Enables KV cache offload.
skip_lang	False	Used in dual‑QPC compilation. When set, language/text binary compilation is skipped.
skip_vision	False	Used in dual‑QPC compilation. When set, vision binary compilation is skipped.
pooling_device	None	Device on which pooling runs. Must be “qaic” or “cpu””. Required for pooled outputs.
pooling_method	None	Pooling method when “pooling_device=qaic”. Supported: “mean”, “avg”, “cls”, “max” or custom poolers.
normalize	False	Normalizes pooled outputs when using “pooling_device=qaic”.
softmax	False	Applies softmax to pooled outputs when using “pooling_device=qaic”.
prefill_only	None	Used for disaggregated serving. “True” → compile prefill QPC only “False” → compile decode QPC only “None” → single QPC for both stages

vLLM flags and environment variables¶

Input Arg	Default Value	Setting Required for Qaic runs
VLLM_QAIC_QPC_PATH	None	Set this flag with the path to qpc. vLLM loads the qpc directly from the path provided and will not compile the model
VLLM_QAIC_MOS	None	Set MOS value
VLLM_QAIC_DFS_EN	None	Enable compiler depth first
VLLM_QAIC_QID	None	Manually set QID for qaic devices
VLLM_QAIC_NUM_CORES	None	Set num_cores example 14 or 16
VLLM_QAIC_COMPILER_ARGS	None	Set additional compiler arguments through this environment variable
VLLM_QAIC_MAX_CPU_THREADS	None	Avoid oversubscription of CPU threads, during multi-instance execution. By default there is no limit, if user set an environment variable VLLM_ QAIC_MAX_CPU_THREADS, then number of cpu thread running pytorch sampling on cpu is limited, to avoid over-subscription. The contention is amplified when running in a container where CPU limits can cause throttling.

Avoiding CPU oversubscription via VLLM_QAIC_MAX_CPU_THREADS¶

CPU oversubscription refers to a situation where the total number of CPUs allocated to a system exceeds the total number of CPUs available on the hardware. This leads to severe contention for CPU resources. In such cases, there is frequent switching between processes, which increases processes switching overhead and decreases overall system efficiency. In containers where multiple instances of vLLM can cause oversubscription, limiting the concurrent CPU threads is a good way to avoid oversubscription.

Example

export VLLM_QAIC_MAX_CPU_THREADS=8
export OMP_NUM_THREADS=8