Disaggregated LLM Serving

These images enable prefill/decode disaggregated serving, allowing independent scaling of different LLM execution stages.

Recommended images:

  • cloud_ai_inference_vllm_disagg

  • cloud_ai_inference_vllm_085_disagg

  • cloud_ai_inference_vllm_py312_disagg

Typical usage:

  • Run qaic‑disagg entrypoint

  • Assign different device groups to prefill and decode

  • Optimize throughput and latency at scale

Example: Disaggregated serving

docker pull ghcr.io/quic/cloud_ai_inference_vllm_disagg:1.21.2.0
docker run --rm -it \
  --shm-size=2gb \
  --network host \
  -e HF_TOKEN=<your_hf_token> \
  --device /dev/accel/ \
  ghcr.io/quic/cloud_ai_inference_vllm_disagg:1.21.2.0 \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --prefill-device-group 0..7 \
  --decode-device-group 8..15 \
  --port 8000