On Device Sampling Support with vLLM¶
On Device Sampling enables sampling operations to be executed directly on the QAIC device rather than the host CPU. This enhancement reduces host-device communication overhead and improves inference throughput and scalability.
Supported Sampling Strategies¶
The following sampling techniques are now supported natively on the QAIC device:
Sampling Strategy |
Description |
|---|---|
Repetition Penalty |
Penalize tokens that have appeared in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. Set to 1.0 to avoid penalizing. |
Presence Penalty |
Penalize tokens that are present in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens. Set to 0.0 to avoid penalizing. |
Temperature Scaling |
Adjust the sharpness of the logits distribution. Lower values make the model more deterministic, while higher values make the model more random. 0.0 means greedy sampling. |
Top K |
Sample from the |
Top P |
Sample from the smallest set of tokens whose cumulative probability is greater than or equal to |
Min P |
Represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0.0 to disable this. |
Greedy Sampling |
Choose the token with highest value. |
Random Sampling |
Choose a token randomly with its probability of being chosen given by its value. |
Implementation Details¶
Sampler Integration: Sampling logic is enabled by setting
"aic_include_sampler": Trueinoverride_qaic_configwhile instantiating a LLM object.Text Generation: For each input prompt, Repetition Penalty, Presence Penalty, Temperature Scaling, Top K, Top P, and Min P sampling strategies are enabled/disabled by providing their respective values in a
SamplingParamsobject. This is optional; if one or more values are not provided, default values are used.Example Usage:
sampling_params = SamplingParams( repetition_penalty=1.5, presence_penalty=0.5, temperature=0.01, top_k=512, top_p=0.9, min_p=0.95, n=1, ignore_eos=False, seed=0, ) llm = LLM( ... override_qaic_config={ # On Device Sampling "aic_include_sampler": True, "aic_return_pdfs": False, "max_top_k_ids": 1024, }, ) outputs = llm.generate(prompts, sampling_params)Note:
Frequency penalties are not supported by the QAIC backend. The Sampler will run without frequency penalties. To use frequency penalties, please use the PyTorch backend.
By default, the QPC is compiled for
"max_top_k_ids": 512. To use a different value, please providemax_top_k_idsinoverride_qaic_config. This will recompile the QPC with the updated value.
Limitations¶
Currently, On Device Sampling cannot be enabled along with any of the following features: Prefix Caching, SpD, Multimodality, or LoRAX.
On Device Sampling works with Causal language models and vision language models (VLMs).
Example Code for On Device Sampling¶
python3 examples/offline_inference/qaic_on_device_sampling.py