Python Backend for LLMs¶

LLM serving through Triton is enabled using triton-qaic-backend-python. This backend supports execution of QPC binaries for decoder-only (Causal) models, with and without KV Cache optimizations, in the LlamaForCausalLM and AutoModelForCausalLM categories.

It provides two server-to-client response modes:

Batch: all generated tokens are cached and returned as a single response at the end of the decode stage.
Decoupled (stream): each generated token is sent to the client as a separate response.

Two sample configurations are included: Mistral and Starcoder. Sample client scripts are also provided to test models in both stream-response and batch-response modes.

Container Setup¶

After starting the Triton container, activate the Qualcomm Efficient Transformers Python environment.

Qualcomm Efficient-Transformers (qeff) virtual environment comprises compatible packages for compilation/execution of a wide range of LLMs. Activate this environment within the container.

source /opt/qeff-env/bin/activate

Generate LLM Model Repository¶

Sample LLM Models¶

Model Folder	Model Type	Response Type
starcoder_15b	Causal	Batch
starcoder_decoupled	Causal	Decoupled (Stream)
mistral_7b	KV cache	Batch
mistral_decoupled	KV cache	Decoupled (Stream)

Pass in the QPC to generate_llm_model_repo.py script available at /opt/qti-aic/integrations/triton/release-artifacts/llm-models/ within Triton container.

python generate_llm_model_repo.py --model_name mistral_7b --aic_binary_dir <path/to/qpc> --python_backend_dir /opt/qti-aic/integrations/triton/backends/qaic/qaic_backend_python/

Custom LLM Models¶

generate_llm_model_repo.py script uses a template to auto-generate config for custom models. Configure required parameters such as use_kv_cache, model_name, decoupled transaction policy through command line options to the script. Choosing model_type will configure use_kv_cache parameter. If not provided, it will be determined by loading QPC object which may take several minutes for large models. This creates a model folder in /opt/qti-aic/integrations/triton/release-artifacts/llm-models/llm_model_dir

Configure compilation parameters such as batch_size, full_batch_size, prefill_seq_len, ctx_len etc in a json. This file will be copied to model repo. If the file is not provided, backend compiles the model with default configuration.

Configuring full_batch_size triggers continuous batching mode. In this mode, the server can handle a number of prompts greater than full_batch_size. In regular mode, prompts are truncated or extended based on the batch size.

If aic_binary_dir is provided (compiled binary is available), the backend loads this binary and skip model download and compilation.

# Sample config for compile_config.json
{
    "num_cores": 16,
    "prefill_seq_len": 32,
    "ctx_len": 256,
    "full_batch_size": 3,
    "mxfp6_matmul": true
}

Keys supported in config are specified below.

'onnx_path'
'prefill_seq_len'
'ctx_len'
'batch_size'
'full_batch_size'
'num_cores'
'mxfp6_matmul'
'mxint8_kv_cache'
'aic_enable_depth_first'
'mos'

python generate_llm_model_repo.py --model_name <custom_model> \
                                  --aic_binary_dir <path/to/qpc> \
                                  --compile_config_path < path/to/compilation/config/json > \
                                  --hf_model_name \
                                  --model_type <causal/kv_cache> \
                                  --decoupled

Launch Triton and Load LLMs¶

Prerequisite: Users need to get access for necessary models from Hugging Face and login with Hugging Face token using ‘huggingface-cli login’ before launching the server.

Launch Triton server with llm_model_dir inside the Triton container:

/opt/tritonserver/bin/tritonserver --model-repository=<path/to/llm_model_dir>

Start Client Container¶

docker run -it --rm -v /path/to/unzipped/apps-sdk/integrations/triton/release-artifacts/llm-models/:/llm-models --net=host nvcr.io/nvidia/tritonserver:22.12-py3-sdk bash

Once the server has started you can run example Triton client (client_example_kv.py/client_example_causal.py) provided to submit inference requests to loaded models.

Decoupled model transaction policy is supported only over gRPC protocol. Therefore, decoupled models (stream response) use gRPC clients whereas batch response mode uses HTTP client as a sample.

# mistral_decoupled
python /llm-models/tests/stream-response/client_example_kv.py --prompt "My name is"

# mistral_decoupled (QPC compiled for batch_size=2)
python /llm-models/tests/stream-response/client_example_kv.py --prompt "My name is|Maroon bells"

# mistral_7b
python /llm-models/tests/batch-response/client_example_kv.py --prompt "My name is"

# starcoder_decoupled
python /llm-models/tests/stream-response/client_example_causal.py --prompt "Write a python program to print hello world"

# starcoder_15b
python /llm-models/tests/batch-response/client_example_causal.py --prompt "Write a python program to print hello world"

Note: For batch-response tests, the default network timeouts are 10 minutes (600 seconds) in client_example_kv.py and 100 minutes (6000 seconds) in client_example_causal.py.

# User can also use generate API to do inferencing from Triton client container
curl -X POST localhost:8000/v2/models/mistral_7b/generate -d '{"prompt": "My name is","id": "42"}'