Compute Context Length (CCL)¶

Compute Context Length (CCL) is a compilation and runtime feature that improves performance for large-context inference by dynamically switching between multiple compute context lengths within a single compiled model.

Traditional Ahead-of-Time (AoT) compilation specializes a model for one fixed context length. As context length increases, throughput drops and time-to-first-token (TTFT) increases, even when the actual prompt and generation lengths are small. CCL addresses this problem by:

Compiling one model once with multiple context-length specializations
Dynamically selecting the appropriate compute context length at runtime
Avoiding unnecessary attention computation over the full maximum context

Key properties of CCL:¶

One compilation → one QPC
QPC size is unchanged compared to non-CCL models
Runtime switching is automatic and transparent to the user
Enabled with a single flag: ccl_enabled=True

CCL works across:¶

vLLM
SpD + vLLM
Disaggregated Serving (decode-only)

Performance benefits increase significantly for 8K+ context lengths, while remaining safe and usable for smaller contexts.

High-Level Architecture¶

At a high level, CCL introduces dynamic compute shapes into the model graph:

During compilation, the model is specialized for a list of compute context lengths (CCL lists)
During runtime, the system selects the proper CCL value based on position IDs
KV-cache reads, attention masks, and memory accesses are restricted to the active CCL

Environment Setup¶

Follow the standard vLLM installation instructions from Qualcomm documentation:

Using a pre-built Docker inference container
Installing vLLM from source

CCL does not require a separate installation. It is enabled entirely through configuration flags.

CCL Configuration¶

CCL can be enabled in two ways:

Automatic mode (recommended)
Manual CCL list specification (advanced)

Automatic Mode¶

In automatic mode, you only need to pass:

ccl_enabled=True

The system automatically generates optimized CCL lists for both prefilling and decoding.

Manual Mode (Advanced)¶

You may optionally specify explicit CCL lists:

comp_ctx_lengths_prefill = [ ... ]
comp_ctx_lengths_decode  = [ ... ]

Rules and guidelines:

Prefill and decode lists must not overlap
The total number of CCL values across both lists should be ≤ 10
Larger lists increase the number of specializations and may reduce performance

If manual lists are not provided, optimized lists are generated automatically.

Compilation with CCL¶

CCL does not require a separate compilation flow. Compilation happens automatically when the model is first loaded with ccl_enabled=True.

Key points:

The model is compiled once
Multiple CCL specializations are embedded in the same QPC
No recompilation is required for different prompt sizes

Launching vLLM Server with CCL¶

Example: Online serving with vLLM with CCL enabled

python3 -m vllm.entrypoints.api_server \
  --host 127.0.0.1 \
  --port 8000 \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --max-model-len 256 \
  --max-num-seq 16 \
  --max-seq-len-to-capture 128 \
  --device qaic \
  --block-size 32 \
  --quantization mxfp6 \
  --kv-cache-dtype mxint8 \
  --override-qaic-config "ccl_enabled=True"

python3 examples/openai_chat_completion_client.py

Using CCL with SpD + vLLM¶

SpD + vLLM uses the same interface as standard vLLM.

Enable CCL by passing:

ccl_enabled=True

Optional prefill and decode CCL lists may be provided, but are not required. Automatic list generation is supported and recommended.

Using CCL with Disaggregated Serving¶

Important limitations:

CCL is supported only during decoding
Prefilling with CCL in disaggregated mode is not supported

To enable CCL during decoding:

--decode-override-qaic-config "ccl_enabled=True,comp_ctx_lengths_decode=[4096,8192,16384,32768]"

Note

Prefill runs without CCL.

Benchmarking with CCL¶

vLLM benchmarking with CCL uses the same scripts as standard vLLM.

Example:

python3 benchmarks/benchmark_serving.py \
  --backend openai \
  --base-url http://127.0.0.1:8000 \
  --dataset-name sharegpt \
  --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
  --sharegpt-max-input-len 128 \
  --sharegpt-max-model-len 256 \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --seed 12345

End-to-End Example¶

Install the Qualcomm Cloud AI SDK and build vLLM.
Choose a model (for example, TinyLlama or Llama3).
Launch vLLM with:

--override-qaic-config "ccl_enabled=True"

Send requests using OpenAI-compatible APIs.
Benchmark using standard vLLM tools.
Scale prompt size and generation length without recompiling.

Best Practices¶

Compile with a large maximum context (for example, 128K) if future needs are unknown
Avoid over-populating CCL lists