Build vLLM (Optional)¶
This guide demonstrates how to add Qualcomm Cloud AI backend support to the vLLM open-source library, which simplifies the creation of OpenAI-compatible web endpoints and provides features like continuous batching and other optimizations for LLM inference and serving.
Note¶
Multiple vLLM versions support is available, users can select the vLLM version that aligns with their target model and feature requirements. For more details refer to Cross Feature Support Matrix
Running a sample inference¶
python examples/offline_inference/qaic.py
vLLM Deployment, Supported Features and Capabilities¶
The vLLM deployment workflow describes the end-to-end process for serving Large Language Models (LLMs) and Vision-Language Models (VLMs) using vLLM on Qualcomm® Cloud AI accelerators.
The vLLM supported features and capabilities section highlights the key functionalities available when running vLLM with Qualcomm Cloud AI backend support. These include optimized model execution, continuous batching, advanced quantization options, large-context support, and integration with performance-enhancing features such as prefix caching and disaggregated serving. Together, these capabilities enable efficient, flexible, and production-ready inference for modern LLM and VLM workloads on AI 100 accelerators.