QEfficient Auto Classes
QEFFAutoModelForCausalLM
- class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCausalLM(model: Module, continuous_batching: bool = False, qaic_config: dict | None = None, **kwargs)[source]
QEfficient class for Causal Language Models from the HuggingFace hub (e.g., GPT-2, Llama).
This class provides a unified interface for loading, exporting, compiling, and generating text with causal language models on Cloud AI 100 hardware. It supports features like continuous batching, speculative decoding (TLM), and on-device sampling.
Example
from QEfficient import QEFFAutoModelForCausalLM from transformers import AutoTokenizer model = QEFFAutoModelForCausalLM.from_pretrained("gpt2") model.compile(num_cores=16) tokenizer = AutoTokenizer.from_pretrained("gpt2") model.generate(prompts=["Hi there!!"], tokenizer=tokenizer)
High-Level API
- classmethod QEFFAutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path, continuous_batching: bool = False, qaic_config: dict | None = None, *args, **kwargs)[source]
Load a QEfficient Causal Language Model from a pretrained HuggingFace model or local path.
This is the recommended way to initialize a QEfficient Causal Language Model. The interface is similar to
transformers.AutoModelForCausalLM.from_pretrained
. Once initialized, you can use methods such asexport
,compile
, andgenerate
.- Parameters:
pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
continuous_batching (bool, optional) – Whether this model will be used for continuous batching in the future. If not set to True here, the model cannot be exported/compiled for continuous batching later. Default is False.
qaic_config (dict, optional) –
QAIC config dictionary. Supported keys include:
speculative_model_type (str): Specify Speculative Decoding Target Language Models.
include_sampler (bool): Enable/Disable sampling of next tokens.
return_pdfs (bool): Return probability distributions along with sampled next tokens. For Speculative Decoding Target Language Model,
return_pdfs=True
always. Otherwise,return_pdfs=True
for Speculative Decoding Draft Language Model andreturn_pdfs=False
for regular model.max_top_k_ids (int): Maximum number of top K tokens (<= vocab size) to consider during sampling. The values provided in
top_ks
tensor must be less than this maximum limit.
*args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.
**kwargs –
Additional keyword arguments passed directly to cls._hf_auto_class.from_pretrained.
Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.
- Returns:
An instance initialized with the pretrained weights.
- Return type:
QEFFAutoModelForCausalLM
- QEFFAutoModelForCausalLM.export(export_dir: str | None = None) str [source]
Export the model to ONNX format using
torch.onnx.export
.This method prepares example inputs and dynamic axes based on the model configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware. It handles KV cache inputs/outputs and sampler-related inputs.
- Parameters:
export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.
- Returns:
Path to the generated ONNX graph file.
- Return type:
str
- QEFFAutoModelForCausalLM.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, prefill_seq_len: int = 32, ctx_len: int = 128, batch_size: int = 1, full_batch_size: int | None = None, kv_cache_batch_size: int | None = None, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, num_speculative_tokens: int | None = None, prefill_only: bool | None = None, **compiler_options) str [source]
Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.
This method generates a
qpc
package. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-exec compiler can be passed as keyword arguments.- Parameters:
onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.
compile_dir (str, optional) – Directory to save the generated QPC package. If not provided, a default directory is used.
prefill_seq_len (int, optional) – Length of the prefill prompt. Default is 32.
ctx_len (int, optional) – Maximum context length the compiled model can remember. Default is 128.
batch_size (int, optional) – Batch size. Default is 1.
full_batch_size (int, optional) – Continuous batching batch size. Required if continuous_batching=True was set during from_pretrained.
kv_cache_batch_size (int, optional) – Batch size for KV cache. If not provided, it defaults to full_batch_size (if continuous batching) or batch_size.
num_devices (int, optional) – Number of devices to compile for. Default is 1.
num_cores (int, optional) – Number of cores to use for compilation.
mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.
mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.
num_speculative_tokens (int, optional) – Number of speculative tokens for Speculative Decoding Target Language Model. Required if the model is configured as a Target Language Model (is_tlm=True).
prefill_only (bool, optional) – If True, compiles only for the prefill stage. If False, compiles only for the decode stage. If None, compiles for both stages. Default is None.
**compiler_options (dict) –
Additional compiler options for QAIC or QNN compilers.
For QAIC Compiler: Extra arguments for qaic-exec can be passed. Some common options include:
mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.
aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.
allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.
Params are converted to flags as below:
aic_num_cores=16
->-aic-num-cores=16
convert_to_fp16=True
->-convert-to-fp16
For QNN Compiler: Following arguments can be passed as:
enable_qnn (bool): Enables QNN Compilation.
qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.
- Returns:
Path to the compiled QPC package.
- Return type:
str
- Raises:
TypeError – If prefill_only is not a boolean. If full_batch_size is None when continuous_batching is True. If num_speculative_tokens is None when the model is a TLM.
ValueError – If KV caching is requested without continuous batching (full_batch_size). If include_sampler is True and num_speculative_tokens is greater than 0. If num_speculative_tokens is not an integer greater than 1. If prefill_seq_len is less than num_speculative_tokens + 1 for TLM models.
- QEFFAutoModelForCausalLM.generate(tokenizer: PreTrainedTokenizerFast | PreTrainedTokenizer, prompts: List[str], device_id: List[int] | None = None, runtime_ai100: bool = True, **kwargs)[source]
Generate output by executing the compiled QPC on Cloud AI 100 hardware.
This method runs sequential execution based on the compiled model’s batch size and the number of prompts. If the number of prompts is not divisible by the batch size, the last batch will be dropped.
- Parameters:
tokenizer (PreTrainedTokenizer or PreTrainedTokenizerFast) – Tokenizer for the model.
prompts (list of str) – List of prompts to generate output for.
device_id (list of int, optional) – Device IDs for running the QPC. Defaults to [0] if not specified.
runtime_ai100 (bool, optional) – Whether to use AI 100 runtime. Default is True.
**kwargs – Additional keyword arguments. Currently supports: - generation_len (int, optional): The maximum number of tokens to generate.
- Returns:
Output from the AI 100 runtime, containing generated IDs and performance metrics.
- Return type:
CloudAI100ExecInfoNew
- Raises:
TypeError – If the QPC path is not set (i.e., compile was not run).
NotImplementedError – If runtime_ai100 is False.
QEFFAutoModel
- class QEfficient.transformers.models.modeling_auto.QEFFAutoModel(model: Module, pooling=None, **kwargs)[source]
QEfficient class for general transformer models from the HuggingFace hub (e.g., BERT, Sentence Transformers).
This class provides a unified interface for loading, exporting, compiling, and running various encoder-only transformer models on Cloud AI 100 hardware. It supports pooling for embedding extraction.
Example
from QEfficient import QEFFAutoModel from transformers import AutoTokenizer model = QEFFAutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", pooling="mean") model.compile(num_cores=16) tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") inputs = tokenizer("My name is", return_tensors="pt") output = model.generate(inputs) print(output) # Output will be a dictionary containing extracted features.
High-Level API
- classmethod QEFFAutoModel.from_pretrained(pretrained_model_name_or_path, pooling=None, *args, **kwargs)[source]
Load a QEfficient transformer model from a pretrained HuggingFace model or local path.
This is the recommended way to initialize a QEfficient transformer model. The interface is similar to
transformers.AutoModel.from_pretrained
. Once initialized, you can use methods such asexport
,compile
, andgenerate
.- Parameters:
pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
pooling (str or Callable, optional) – The pooling method to use. Options include: - “mean”: Mean pooling - “max”: Max pooling - “cls”: CLS token pooling - “avg”: Average pooling - Callable: A custom pooling function - None: No pooling applied. Default is None.
*args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.
**kwargs –
Additional keyword arguments passed directly to cls._hf_auto_class.from_pretrained.
Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.
- Returns:
An instance initialized with the pretrained weights.
- Return type:
QEFFAutoModel
- QEFFAutoModel.export(export_dir: str | None = None) str [source]
Export the model to ONNX format using
torch.onnx.export
.This method prepares example inputs and dynamic axes based on the model configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware.
- Parameters:
export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.
- Returns:
Path to the generated ONNX graph file.
- Return type:
str
- QEFFAutoModel.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, seq_len: int | List[int] = 32, batch_size: int = 1, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, **compiler_options) str [source]
Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.
This method generates a
qpc
package. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-exec compiler can be passed as keyword arguments.- Parameters:
onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.
compile_dir (str, optional) – Directory to save the generated QPC package. If not provided, a default directory is used.
seq_len (int or list of int, optional) – The length(s) of the prompt(s) to compile for. Can be a single integer or a list of integers to create multiple specializations. Default is 32.
batch_size (int, optional) – Batch size. Default is 1.
num_devices (int, optional) – Number of devices to compile for. Default is 1.
num_cores (int, optional) – Number of cores to use for compilation.
mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.
**compiler_options (dict) –
Additional compiler options for QAIC or QNN compilers. These are passed directly to the underlying compilation command.
For QAIC Compiler: Extra arguments for qaic-exec can be passed. Some common options include:
mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.
aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.
allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.
Params are converted to flags as below:
aic_num_cores=16
->-aic-num-cores=16
convert_to_fp16=True
->-convert-to-fp16
For QNN Compiler: Following arguments can be passed as:
enable_qnn (bool): Enables QNN Compilation.
qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.
- Returns:
Path to the compiled QPC package.
- Return type:
str
- QEFFAutoModel.generate(inputs: Tensor, device_ids: List[int] | None = None, runtime_ai100: bool = True) Tensor | ndarray [source]
Generate output by executing the compiled QPC on Cloud AI 100 hardware or using PyTorch runtime.
This method runs sequential execution based on the compiled model’s batch size and the number of prompts. If the number of prompts is not divisible by the batch size, the last batch will be dropped.
- Parameters:
inputs (torch.Tensor or np.ndarray) – Input data for the model. For AI 100 runtime, this typically includes input_ids and attention_mask.
device_ids (list of int, optional) – Device IDs for running the QPC. Defaults to [0] if not specified and runtime_ai100 is True.
runtime_ai100 (bool, optional) – Whether to use the AI 100 runtime for inference. If False, the PyTorch runtime will be used. Default is True.
- Returns:
Output from the AI 100 or PyTorch runtime. The type depends on the runtime and model.
- Return type:
torch.Tensor or np.ndarray
QEffAutoPeftModelForCausalLM
- class QEfficient.peft.auto.QEffAutoPeftModelForCausalLM(model: Module)[source]
QEfficient class for loading and running Causal Language Models with PEFT adapters (currently only LoRA is supported).
This class enables efficient inference and deployment of PEFT-adapted models on Cloud AI 100 hardware. Once exported and compiled for an adapter, the same base model can be reused with other compatible adapters.
Example
from transformers import AutoTokenizer, TextStreamer from QEfficient import QEffAutoPeftModelForCausalLM base_model_name = "mistralai/Mistral-7B-v0.1" tokenizer = AutoTokenizer.from_pretrained(base_model_name) streamer = TextStreamer(tokenizer) m = QEffAutoPeftModelForCausalLM.from_pretrained("predibase/magicoder", "magicoder") m.export() m.compile(prefill_seq_len=32, ctx_len=1024) # Magicoder adapter m.set_adapter("magicoder") inputs = tokenizer("def fibonacci", return_tensors="pt") m.generate(**inputs, streamer=streamer, max_new_tokens=1024) # Math problems m.load_adapter("predibase/gsm8k", "gsm8k") m.set_adapter("gsm8k") inputs = tokenizer("James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?",return_tensors="pt") m.generate(**inputs, streamer=streamer, max_new_tokens=1024)
High-Level API
- classmethod QEffAutoPeftModelForCausalLM.from_pretrained(pretrained_name_or_path: str, *args, **kwargs)[source]
Load a QEffAutoPeftModelForCausalLM from a pretrained model and adapter.
- Parameters:
pretrained_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
finite_adapters (bool, optional) – Set True to enable finite adapter mode with QEffAutoLoraModelForCausalLM class.
adapter_name (str, optional) – Name used to identify the loaded adapter.
*args – Additional positional arguments for peft.AutoPeftModelForCausalLM.
**kwargs – Additional keyword arguments for peft.AutoPeftModelForCausalLM.
- Returns:
An instance initialized with the pretrained weights and adapter.
- Return type:
QEffAutoPeftModelForCausalLM
- Raises:
NotImplementedError – If continuous batching is requested (not supported).
TypeError – If adapter name is missing in finite adapter mode.
- QEffAutoPeftModelForCausalLM.export(export_dir: str | None = None) str [source]
Export the model with the active adapter to ONNX format.
This method prepares example inputs and dynamic axes based on the model and adapter configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware.
- Parameters:
export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.
- Returns:
Path to the generated ONNX graph file.
- Return type:
str
- QEffAutoPeftModelForCausalLM.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, batch_size: int = 1, prefill_seq_len: int, ctx_len: int, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, **compiler_options) str [source]
Compile the exported ONNX model for Cloud AI 100 hardware.
This method generates a QPC package. If the model has not been exported yet, this method will handle the export process. Additional arguments for the QAIC compiler can be passed as keyword arguments.
- Parameters:
onnx_path (str, optional) – Path to a pre-exported ONNX model.
compile_dir (str, optional) – Directory to save the generated QPC package.
batch_size (int, optional) – Batch size for compilation. Default is 1.
prefill_seq_len (int) – Length of the prefill prompt.
ctx_len (int) – Maximum context length the compiled model can remember.
num_devices (int, optional) – Number of devices to compile for. Default is 1.
num_cores (int, optional) – Number of cores to use for compilation. Default is 16.
mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.
mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.
**compiler_options –
Additional compiler options for QAIC.
For QAIC Compiler: Extra arguments for qaic-exec can be passed. Some common options include:
mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.
aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.
allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.
Params are converted to flags as below:
aic_num_cores=16
->-aic-num-cores=16
convert_to_fp16=True
->-convert-to-fp16
For QNN Compiler: Following arguments can be passed as:
enable_qnn (bool): Enables QNN Compilation.
qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.
- Returns:
Path to the compiled QPC package.
- Return type:
str
- QEffAutoPeftModelForCausalLM.generate(inputs: Tensor | ndarray | None = None, device_ids: List[int] | None = None, generation_config: GenerationConfig | None = None, stopping_criteria: StoppingCriteria | None = None, streamer: BaseStreamer | None = None, **kwargs) ndarray [source]
Generate tokens from the compiled binary using the active adapter.
This method takes similar parameters as HuggingFace’s
model.generate()
method.- Parameters:
inputs (torch.Tensor or np.ndarray, optional) – Input IDs for generation.
device_ids (List[int], optional) – Device IDs for running inference.
generation_config (GenerationConfig, optional) – Generation configuration to merge with model-specific config.
stopping_criteria (StoppingCriteria, optional) – Custom stopping criteria for generation.
streamer (BaseStreamer, optional) – Streamer to receive generated tokens.
**kwargs – Additional parameters for generation_config or to be passed to the model.
- Returns:
Generated token IDs.
- Return type:
np.ndarray
QEffAutoLoraModelForCausalLM
- class QEfficient.peft.lora.auto.QEffAutoLoraModelForCausalLM(model: Module, continuous_batching: bool = False, **kwargs)[source]
QEfficient class for loading models with multiple LoRA adapters for causal language modeling.
This class enables mixed batch inference with different adapters on Cloud AI 100 hardware. Currently, only Mistral and Llama models are supported. Once exported and compiled, the QPC can perform mixed batch inference using the prompt_to_adapter_mapping argument.
Example
from QEfficient.peft.lora import QEffAutoLoraModelForCausalLM from transformers import AutoTokenizer m = QEffAutoLoraModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", num_hidden_layers=1) tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") m.load_adapter("predibase/gsm8k", "gsm8k") m.load_adapter("predibase/magicoder", "magicoder") m.compile() prompts = ["code prompt", "math prompt", "generic"] m.generate(prompts=prompts, tokenizer=tokenizer,prompt_to_adapter_mapping=["magicoder", "gsm8k", "base"])
High-Level API
- classmethod QEffAutoLoraModelForCausalLM.from_pretrained(pretrained_model_name_or_path, continuous_batching: bool = False, qaic_config: dict | None = None, *args, **kwargs)
Load a QEfficient Causal Language Model from a pretrained HuggingFace model or local path.
This is the recommended way to initialize a QEfficient Causal Language Model. The interface is similar to
transformers.AutoModelForCausalLM.from_pretrained
. Once initialized, you can use methods such asexport
,compile
, andgenerate
.- Parameters:
pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
continuous_batching (bool, optional) – Whether this model will be used for continuous batching in the future. If not set to True here, the model cannot be exported/compiled for continuous batching later. Default is False.
qaic_config (dict, optional) –
QAIC config dictionary. Supported keys include:
speculative_model_type (str): Specify Speculative Decoding Target Language Models.
include_sampler (bool): Enable/Disable sampling of next tokens.
return_pdfs (bool): Return probability distributions along with sampled next tokens. For Speculative Decoding Target Language Model,
return_pdfs=True
always. Otherwise,return_pdfs=True
for Speculative Decoding Draft Language Model andreturn_pdfs=False
for regular model.max_top_k_ids (int): Maximum number of top K tokens (<= vocab size) to consider during sampling. The values provided in
top_ks
tensor must be less than this maximum limit.
*args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.
**kwargs –
Additional keyword arguments passed directly to cls._hf_auto_class.from_pretrained.
Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.
- Returns:
An instance initialized with the pretrained weights.
- Return type:
QEFFAutoModelForCausalLM
- QEffAutoLoraModelForCausalLM.export(export_dir: str | None = None) str [source]
Export the model with all loaded adapters to ONNX format using
torch.onnx.export
.The exported ONNX graph will support mixed batch inference with multiple adapters.
- Parameters:
export_dir (str, optional) – Directory to save the exported ONNX graph. If not provided, the default export directory is used.
- Returns:
Path to the generated ONNX graph.
- Return type:
str
- Raises:
ValueError – If no adapters are loaded.
- QEffAutoLoraModelForCausalLM.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, prefill_seq_len: int = 32, ctx_len: int = 128, batch_size: int = 1, full_batch_size: int | None = None, kv_cache_batch_size: int | None = None, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, num_speculative_tokens: int | None = None, prefill_only: bool | None = None, **compiler_options) str
Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.
This method generates a
qpc
package. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-exec compiler can be passed as keyword arguments.- Parameters:
onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.
compile_dir (str, optional) – Directory to save the generated QPC package. If not provided, a default directory is used.
prefill_seq_len (int, optional) – Length of the prefill prompt. Default is 32.
ctx_len (int, optional) – Maximum context length the compiled model can remember. Default is 128.
batch_size (int, optional) – Batch size. Default is 1.
full_batch_size (int, optional) – Continuous batching batch size. Required if continuous_batching=True was set during from_pretrained.
kv_cache_batch_size (int, optional) – Batch size for KV cache. If not provided, it defaults to full_batch_size (if continuous batching) or batch_size.
num_devices (int, optional) – Number of devices to compile for. Default is 1.
num_cores (int, optional) – Number of cores to use for compilation.
mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.
mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.
num_speculative_tokens (int, optional) – Number of speculative tokens for Speculative Decoding Target Language Model. Required if the model is configured as a Target Language Model (is_tlm=True).
prefill_only (bool, optional) – If True, compiles only for the prefill stage. If False, compiles only for the decode stage. If None, compiles for both stages. Default is None.
**compiler_options (dict) –
Additional compiler options for QAIC or QNN compilers.
For QAIC Compiler: Extra arguments for qaic-exec can be passed. Some common options include:
mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.
aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.
allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.
Params are converted to flags as below:
aic_num_cores=16
->-aic-num-cores=16
convert_to_fp16=True
->-convert-to-fp16
For QNN Compiler: Following arguments can be passed as:
enable_qnn (bool): Enables QNN Compilation.
qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.
- Returns:
Path to the compiled QPC package.
- Return type:
str
- Raises:
TypeError – If prefill_only is not a boolean. If full_batch_size is None when continuous_batching is True. If num_speculative_tokens is None when the model is a TLM.
ValueError – If KV caching is requested without continuous batching (full_batch_size). If include_sampler is True and num_speculative_tokens is greater than 0. If num_speculative_tokens is not an integer greater than 1. If prefill_seq_len is less than num_speculative_tokens + 1 for TLM models.
- QEffAutoLoraModelForCausalLM.generate(tokenizer: PreTrainedTokenizerFast | PreTrainedTokenizer, prompts: List[str], prompt_to_adapter_mapping: List[str] | None = None, device_id: List[int] | None = None, runtime: str | None = 'AI_100', **kwargs)[source]
Generate output for a batch of prompts using the compiled QPC on Cloud AI 100 hardware.
This method supports mixed batch inference, where each prompt can use a different adapter as specified by prompt_to_adapter_mapping. If the number of prompts is not divisible by the compiled batch size, the last incomplete batch will be dropped.
- Parameters:
tokenizer (PreTrainedTokenizerFast or PreTrainedTokenizer) – Tokenizer used for inference.
prompts (List[str]) – List of prompts to generate outputs for.
prompt_to_adapter_mapping (List[str]) – List of adapter names to use for each prompt. Use “base” for the base model (no adapter).
device_id (List[int], optional) – Device IDs to use for execution. If None, auto-device-picker is used.
runtime (str, optional) – Runtime to use. Only “AI_100” is currently supported. Default is “AI_100”.
**kwargs – Additional generation parameters.
- Returns:
Model outputs for each prompt.
- Raises:
ValueError – If runtime is not “AI_100”.
TypeError – If the model has not been compiled.
RuntimeError – If the number of prompts does not match the number of adapter mappings.
QEFFAutoModelForImageTextToText
- class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForImageTextToText(model: Module, kv_offload: bool | None = True, **kwargs)[source]
QEfficient class for multimodal (image-text-to-text) models from the HuggingFace hub.
This class supports both single and dual QPC (Quantized Package Compilation) approaches for efficient deployment on Cloud AI 100 hardware. It is recommended to use the
from_pretrained
method for initialization.Example
import requests from PIL import Image from transformers import AutoProcessor, TextStreamer from QEfficient import QEFFAutoModelForImageTextToText HF_TOKEN = "" # Your HuggingFace token if needed model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct" query = "Describe this image." image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" # STEP 1: Load processor and model processor = AutoProcessor.from_pretrained(model_name, token=HF_TOKEN) model = QEFFAutoModelForImageTextToText.from_pretrained( model_name, token=HF_TOKEN, attn_implementation="eager", kv_offload=False # kv_offload=False for single QPC ) # STEP 2: Export & Compile model.compile( prefill_seq_len=32, ctx_len=512, img_size=560, num_cores=16, num_devices=1, mxfp6_matmul=False, ) # STEP 3: Prepare inputs image = Image.open(requests.get(image_url, stream=True).raw) messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": query}, ], } ] input_text = [processor.apply_chat_template(messages, add_generation_prompt=True)] inputs = processor( text=input_text, images=image, return_tensors="pt", add_special_tokens=False, padding="max_length", # Consider padding strategy if max_length is crucial max_length=32, ) # STEP 4: Run inference streamer = TextStreamer(processor.tokenizer) model.generate(inputs=inputs, streamer=streamer, generation_len=512)
High-Level API
- classmethod QEFFAutoModelForImageTextToText.from_pretrained(pretrained_model_name_or_path: str, kv_offload: bool | None = None, **kwargs)[source]
Load a QEfficient image-text-to-text model from a pretrained HuggingFace model or local path.
- Parameters:
pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
kv_offload (bool, optional) – If True, uses the dual QPC approach (vision encoder KV offloaded). If False, uses the single QPC approach (entire model in one QPC). If None, the default behavior of the internal classes is used (typically dual QPC).
**kwargs –
Additional arguments passed to HuggingFace’s
from_pretrained
.Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility. continuous_batching is not supported for image-text-to-text models.
- Returns:
An instance initialized with the pretrained weights, wrapped for QEfficient.
- Return type:
QEFFAutoModelForImageTextToText
- Raises:
NotImplementedError – If continuous_batching is provided as True.
QEFFAutoModelForSpeechSeq2Seq
- class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForSpeechSeq2Seq(*args, **kwargs)[source]
QEfficient class for sequence-to-sequence speech-to-text models (e.g., Whisper, Encoder-Decoder speech models).
This class enables efficient export, compilation, and inference of speech models on Cloud AI 100 hardware. It is recommended to use the
from_pretrained
method for initialization.Example
from datasets import load_dataset from transformers import AutoProcessor from QEfficient import QEFFAutoModelForSpeechSeq2Seq base_model_name = "openai/whisper-tiny" ## STEP 1 -- load audio sample, using a standard english dataset, can load specific files if longer audio needs to be tested; also load initial processor ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") data = ds[0]["audio"]["array"] # reshape to so shape corresponds to data with batch size 1 data = data.reshape(-1) sample_rate = ds[0]["audio"]["sampling_rate"] processor = AutoProcessor.from_pretrained(base_model_name) ## STEP 2 -- init base model qeff_model = QEFFAutoModelForSpeechSeq2Seq.from_pretrained(base_model_name) ## STEP 3 -- export and compile model qeff_model.compile() ## STEP 4 -- generate output for loaded input and processor exec_info = qeff_model.generate(inputs=processor(data, sampling_rate=sample_rate, return_tensors="pt"), generation_len=25) ## STEP 5 (optional) -- use processor to decode output print(processor.batch_decode(exec_info.generated_ids)[0])
High-Level API
- classmethod QEFFAutoModelForSpeechSeq2Seq.from_pretrained(pretrained_model_name_or_path: str, *args, **kwargs)
Load a QEfficient transformer model from a pretrained HuggingFace model or local path.
This is the recommended way to initialize any QEfficient transformer model. The interface is similar to
transformers.AutoModel.from_pretrained
.- Parameters:
pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
*args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.
**kwargs –
Keyword arguments passed directly to cls._hf_auto_class.from_pretrained.
Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.
- Returns:
An instance of the specific QEFFAutoModel subclass, initialized with the pretrained weights.
- Return type:
QEFFTransformersBase
- QEFFAutoModelForSpeechSeq2Seq.export(export_dir: str | None = None) str [source]
Export the model to ONNX format using
torch.onnx.export
.This method prepares example inputs and dynamic axes based on the model configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware.
- Parameters:
export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.
- Returns:
Path to the generated ONNX graph file.
- Return type:
str
- QEFFAutoModelForSpeechSeq2Seq.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, prefill_seq_len: int | None = 1, encoder_ctx_len: int | None = None, ctx_len: int = 150, full_batch_size: int | None = None, kv_cache_batch_size: int | None = None, batch_size: int = 1, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, num_speculative_tokens: int | None = None, **compiler_options) str [source]
Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.
This method generates a
qpc
package. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-exec compiler can be passed as keyword arguments.- Parameters:
onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.
compile_dir (str, optional) – Directory to save the generated QPC package.
prefill_seq_len (int, optional) – Prefill sequence length. This parameter is typically not critically used for SpeechSeq2Seq models’ decoder compilation as the first decoder input is seq_len=1. Default is 1.
encoder_ctx_len (int, optional) – Maximum context length for the encoder part of the model. If None, it’s inferred from the model configuration or defaults (e.g., 1500 for Whisper).
ctx_len (int, optional) – Maximum decoder context length. This defines the maximum output sequence length the compiled model can handle. Default is 150.
batch_size (int, optional) – Batch size. Default is 1.
num_devices (int, optional) – Number of devices to compile for. Default is 1.
num_cores (int, optional) – Number of cores to use for compilation.
mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.
mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.
full_batch_size (int, optional) – Not yet supported for this model.
kv_cache_batch_size (int, optional) – Not yet supported for this model.
num_speculative_tokens (int, optional) – Not yet supported for this model.
**compiler_options (dict) –
Additional compiler options for QAIC.
For QAIC Compiler: Extra arguments for qaic-exec can be passed. Some common options include:
mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.
aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.
allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.
Params are converted to flags as below:
aic_num_cores=16
->-aic-num-cores=16
convert_to_fp16=True
->-convert-to-fp16
- Returns:
Path to the compiled QPC package.
- Return type:
str
- QEFFAutoModelForSpeechSeq2Seq.generate(inputs: Tensor, generation_len: int, streamer: TextStreamer | None = None, device_ids: List[int] | None = None) Tensor | ndarray [source]
Generate output until
<|endoftext|>
token or generation_len is reached, by executing the compiled QPC on Cloud AI 100 hardware.This method performs sequential execution based on the compiled model’s batch size and the provided audio tensors. It manages the iterative decoding process and KV cache.
- Parameters:
inputs (Dict[str, np.ndarray]) – Model inputs for inference, typically a dictionary containing: - input_features (np.ndarray): Preprocessed audio features. - decoder_input_ids (np.ndarray): Initial decoder input IDs (e.g., start token). - decoder_position_ids (np.ndarray): Initial decoder position IDs. These should be prepared to match the compiled model’s expectations.
generation_len (int) – Maximum number of tokens to generate. The generation stops if this limit is reached or the model generates an end-of-sequence token.
streamer (TextStreamer, optional) – Streamer to receive generated tokens in real-time. Default is None.
device_ids (List[int], optional) – Device IDs for running the QPC. Defaults to [0] if not specified.
- Returns:
Output from the AI 100 runtime, including generated IDs and performance metrics.
- Return type:
CloudAI100ExecInfoNew
- Raises:
TypeError – If the QPC path is not set (i.e., compile was not run).