Disaggregated LLM Serving¶
These images enable prefill/decode disaggregated serving, allowing independent scaling of different LLM execution stages.
Recommended images:
cloud_ai_inference_vllm_disaggcloud_ai_inference_vllm_085_disaggcloud_ai_inference_vllm_py312_disagg
Typical usage:
Run qaic‑disagg entrypoint
Assign different device groups to prefill and decode
Optimize throughput and latency at scale
Example: Disaggregated serving
docker pull ghcr.io/quic/cloud_ai_inference_vllm_disagg:1.21.2.0
docker run --rm -it \
--shm-size=2gb \
--network host \
-e HF_TOKEN=<your_hf_token> \
--device /dev/accel/ \
ghcr.io/quic/cloud_ai_inference_vllm_disagg:1.21.2.0 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--prefill-device-group 0..7 \
--decode-device-group 8..15 \
--port 8000