Prefix Caching¶
Overview¶
Prefix caching reduces time-to-first-token (TTFT) and compute cost for requests that share a common leading sequence of tokens. When multiple requests begin with the same prefix — such as a system prompt, a few-shot example set, or a long document — the KV cache computed for that prefix is stored and reused rather than recomputed for each new request.
Use cases:
System prompt reuse — chatbots and assistants that prepend the same system prompt to every request benefit immediately; the prompt is processed once and its KV cache is reused across all subsequent turns and sessions.
Few-shot and RAG workloads — retrieval-augmented generation pipelines that prepend the same retrieved context or few-shot examples to many queries avoid redundant prefill computation.
Multi-turn conversations — the growing conversation history shared across turns is cached, so only the new user message requires fresh prefill work.
Document Q&A — when many questions are asked against the same document, the document tokens are encoded once and reused for every question.
On Qualcomm Cloud AI accelerators, prefix caching operates over a fixed KV cache allocated at
server startup. The num-gpu-blocks-override parameter allows the cache to hold
more concurrent user states than the active decode batch size, improving hit rates
for multi-user workloads.
Limitations¶
QAIC vLLM does not support PagedAttention as of now. So, QAIC Prefix Caching can not share the same block between two sequences running simultaneously in the same decode batch.
QAIC Prefix Caching can not currently be enabled along with any of the following features: multimodality, LoRAX, embedding, disaggregated serving, on device sampling.
Parameters¶
Input Arg |
Setting Required for Qaic runs |
|---|---|
enable-prefix-caching |
Set this flag to True to enable prefix caching for qaic |
num-gpu-blocks-override |
Use this flag in case the user wants to maintain the KV cache of more users than the decode batch size (max-num-seqs) |
use_v2_block_manager |
Set to True for prefix caching. |
Example¶
python examples/offline_inference/qaic_prefix_caching.py