vLLM Backend¶
The vLLM (0.10.1.1) backend for Triton is a Python-based backend designed to run supported models on the vLLM AsyncEngine.
Refer to vLLM for more information on model.json configuration parameters, environment variables and benchmarking support.
Launch vLLM Models¶
Sample model repository for TinyLlama model is generated at
"/opt/qti-aic/aic-triton-model-repositories/vllm_model" while
building Triton docker with triton_model_repo application using
Docker. You can use this as
is or change the model by changing model value that is passed to the
vLLM AsyncEngine. Modify model.json as needed.
model.json represents a key-value dictionary that is fed to the vLLM’s AsyncEngine. Please modify the model.json as per need.
model.json sample parameters.
{
"model": "model_name",
"device_group": [0,1,2,3,4], # device_id for execution
"max_num_seqs": <decode_bsz>, # Decode batch size
"max_model_len": <ctx_len>, # Max Context length
"max_seq_len_to_capture": <seq_len>, # Sequence length
"quantization": "mxfp6", # Quantization
"kv_cache_dtype": "mxint8", # KV cache compression
"device": "qaic"
}
Sample config.pbtxt
Activate the vLLM virtual environment inside the Triton container before launching the Triton server:
source /opt/vllm-env/bin/activate
Set up Hugging Face credentials
huggingface-cli login <HF_TOKEN>
Configure number of cores as per NSP availability
export VLLM_QAIC_NUM_CORES=16
Launch the Triton server
/opt/tritonserver/bin/tritonserver --model-repository=/opt/qti-aic/aic-triton-model-repositories/vllm_model
To use the /completions or /chat/completions endpoints, launch OpenAI-compatible Triton server instead of running the binary above.
Prerequisite:
pip install /opt/tritonserver/python/tritonserver-*.whl
cd /opt/tritonserver/python/openai && pip install -r requirements.txt
Launch OpenAI-compatible Triton server
python3 openai_frontend/main.py --model-repository /opt/qti-aic/aic-triton-model-repositories/vllm_model/ --tokenizer <HF_MODEL_TAG>
The Triton server may take a few minutes (depending on the model) to download and compile the model.
Sample
Client
script is available in the sample model repository (built as part of
the Triton image using qaic-docker) /opt/qti-aic/aic-triton-model-repositories/vllm_model/vllm_model.
The sample client script (client.py) can be used to interface with the Triton/vLLM inference server, and can be executed from the Triton client environment.
User can also use the generate API to run inference from the Triton client
container:
curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "My name is","parameters":{"stream":false, "temperature": 0, "max_tokens":1000}}'
Refer to Triton OpenAI User Guide for examples on using OpenAI endpoints for inferencing, benchmarking with genai-perf tool.