Compute Context Length (CCL)¶
Compute Context Length (CCL) is a compilation and runtime feature that improves performance for large-context inference by dynamically switching between multiple compute context lengths within a single compiled model.
Traditional Ahead-of-Time (AoT) compilation specializes a model for one fixed context length. As context length increases, throughput drops and time-to-first-token (TTFT) increases, even when the actual prompt and generation lengths are small. CCL addresses this problem by:
Compiling one model once with multiple context-length specializations
Dynamically selecting the appropriate compute context length at runtime
Avoiding unnecessary attention computation over the full maximum context
Key properties of CCL:¶
One compilation → one QPC
QPC size is unchanged compared to non-CCL models
Runtime switching is automatic and transparent to the user
Enabled with a single flag:
ccl_enabled=True
CCL works across:¶
vLLM
SpD + vLLM
Disaggregated Serving (decode-only)
Performance benefits increase significantly for 8K+ context lengths, while remaining safe and usable for smaller contexts.
High-Level Architecture¶
At a high level, CCL introduces dynamic compute shapes into the model graph:
During compilation, the model is specialized for a list of compute context lengths (CCL lists)
During runtime, the system selects the proper CCL value based on position IDs
KV-cache reads, attention masks, and memory accesses are restricted to the active CCL
Environment Setup¶
Follow the standard vLLM installation instructions from Qualcomm documentation:
Using a pre-built Docker inference container
Installing vLLM from source
CCL does not require a separate installation. It is enabled entirely through configuration flags.
CCL Configuration¶
CCL can be enabled in two ways:
Automatic mode (recommended)
Manual CCL list specification (advanced)
Automatic Mode¶
In automatic mode, you only need to pass:
ccl_enabled=True
The system automatically generates optimized CCL lists for both prefilling and decoding.
Manual Mode (Advanced)¶
You may optionally specify explicit CCL lists:
comp_ctx_lengths_prefill = [ ... ]
comp_ctx_lengths_decode = [ ... ]
Rules and guidelines:
Prefill and decode lists must not overlap
The total number of CCL values across both lists should be ≤ 10
Larger lists increase the number of specializations and may reduce performance
If manual lists are not provided, optimized lists are generated automatically.
Compilation with CCL¶
CCL does not require a separate compilation flow. Compilation happens automatically when the model is first loaded with ccl_enabled=True.
Key points:
The model is compiled once
Multiple CCL specializations are embedded in the same QPC
No recompilation is required for different prompt sizes
Launching vLLM Server with CCL¶
Example: Online serving with vLLM with CCL enabled
python3 -m vllm.entrypoints.api_server \
--host 127.0.0.1 \
--port 8000 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--max-model-len 256 \
--max-num-seq 16 \
--max-seq-len-to-capture 128 \
--device qaic \
--block-size 32 \
--quantization mxfp6 \
--kv-cache-dtype mxint8 \
--override-qaic-config "ccl_enabled=True"
python3 examples/openai_chat_completion_client.py
Using CCL with SpD + vLLM¶
SpD + vLLM uses the same interface as standard vLLM.
Enable CCL by passing:
ccl_enabled=True
Optional prefill and decode CCL lists may be provided, but are not required. Automatic list generation is supported and recommended.
Using CCL with Disaggregated Serving¶
Important limitations:
CCL is supported only during decoding
Prefilling with CCL in disaggregated mode is not supported
To enable CCL during decoding:
--decode-override-qaic-config "ccl_enabled=True,comp_ctx_lengths_decode=[4096,8192,16384,32768]"
Note
Prefill runs without CCL.
Benchmarking with CCL¶
vLLM benchmarking with CCL uses the same scripts as standard vLLM.
Example:
python3 benchmarks/benchmark_serving.py \
--backend openai \
--base-url http://127.0.0.1:8000 \
--dataset-name sharegpt \
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
--sharegpt-max-input-len 128 \
--sharegpt-max-model-len 256 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--seed 12345
End-to-End Example¶
Install the Qualcomm Cloud AI SDK and build vLLM.
Choose a model (for example, TinyLlama or Llama3).
Launch vLLM with:
--override-qaic-config "ccl_enabled=True"
Send requests using OpenAI-compatible APIs.
Benchmark using standard vLLM tools.
Scale prompt size and generation length without recompiling.
Best Practices¶
Compile with a large maximum context (for example, 128K) if future needs are unknown
Avoid over-populating CCL lists