Model Serving

Modern LLM applications require scalable, high‑performance serving engines that can efficiently host large models and deliver low‑latency inference. The Cloud AI SDK supports multiple serving backends, vLLM, Triton, and Text Generation Inference (TGI) to give developers flexibility in how models are deployed, optimized, and integrated into production workflows.

This section provides a unified view of these serving options and helps you select the right runtime for your workload.