Prefix Caching

Overview

Prefix caching reduces time-to-first-token (TTFT) and compute cost for requests that share a common leading sequence of tokens. When multiple requests begin with the same prefix — such as a system prompt, a few-shot example set, or a long document — the KV cache computed for that prefix is stored and reused rather than recomputed for each new request.

Use cases:

  • System prompt reuse — chatbots and assistants that prepend the same system prompt to every request benefit immediately; the prompt is processed once and its KV cache is reused across all subsequent turns and sessions.

  • Few-shot and RAG workloads — retrieval-augmented generation pipelines that prepend the same retrieved context or few-shot examples to many queries avoid redundant prefill computation.

  • Multi-turn conversations — the growing conversation history shared across turns is cached, so only the new user message requires fresh prefill work.

  • Document Q&A — when many questions are asked against the same document, the document tokens are encoded once and reused for every question.

On Qualcomm Cloud AI accelerators, prefix caching operates over a fixed KV cache allocated at server startup. The num-gpu-blocks-override parameter allows the cache to hold more concurrent user states than the active decode batch size, improving hit rates for multi-user workloads.

Limitations

  • QAIC vLLM does not support PagedAttention as of now. So, QAIC Prefix Caching can not share the same block between two sequences running simultaneously in the same decode batch.

  • QAIC Prefix Caching can not currently be enabled along with any of the following features: multimodality, LoRAX, embedding, disaggregated serving, on device sampling.

Parameters

Input Arg

Setting Required for Qaic runs

enable-prefix-caching

Set this flag to True to enable prefix caching for qaic

num-gpu-blocks-override

Use this flag in case the user wants to maintain the KV cache of more users than the decode batch size (max-num-seqs)

use_v2_block_manager

Set to True for prefix caching.

Example

python examples/offline_inference/qaic_prefix_caching.py