LLM Workflows with vLLM

These images are optimized for large language model inference using vLLM and provide OpenAI-compatible APIs out of the box.

Recommended images:

  • cloud_ai_inference_vllm

  • cloud_ai_inference_vllm_085

  • cloud_ai_inference_vllm_py312 (for Python 3.12 / gpt-oss)

Typical usage:

  • Launch vLLM server directly

  • Serve models using OpenAI-compatible endpoints

  • Integrate with existing LLM clients

Example: vLLM inference server

docker pull ghcr.io/quic/cloud_ai_inference_vllm:1.21.2.0
docker run --rm -it \
  --shm-size=2gb \
  --network host \
  -e HF_TOKEN=<your_hf_token> \
  -v $PWD/hf_cache:/workspace/hf_cache \
  --device /dev/accel/ \
  ghcr.io/quic/cloud_ai_inference_vllm:1.21.2.0 \
  --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --port 8000