LLM Workflows with vLLM¶
These images are optimized for large language model inference using vLLM and provide OpenAI-compatible APIs out of the box.
Recommended images:
cloud_ai_inference_vllmcloud_ai_inference_vllm_085cloud_ai_inference_vllm_py312(for Python 3.12 / gpt-oss)
Typical usage:
Launch vLLM server directly
Serve models using OpenAI-compatible endpoints
Integrate with existing LLM clients
Example: vLLM inference server
docker pull ghcr.io/quic/cloud_ai_inference_vllm:1.21.2.0
docker run --rm -it \
--shm-size=2gb \
--network host \
-e HF_TOKEN=<your_hf_token> \
-v $PWD/hf_cache:/workspace/hf_cache \
--device /dev/accel/ \
ghcr.io/quic/cloud_ai_inference_vllm:1.21.2.0 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--port 8000