QEfficient Auto Classes
QEFFAutoModelForCausalLM
- class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCausalLM(model: Module, continuous_batching: bool = False, qaic_config: dict | None = None, max_seq_len_cached: int | None = None, **kwargs)[source]
QEfficient class for Causal Language Models from the HuggingFace hub (e.g., GPT-2, Llama).
This class provides a unified interface for loading, exporting, compiling, and generating text with causal language models on Cloud AI 100 hardware. It supports features like continuous batching, speculative decoding (TLM), and on-device sampling.
Example
from QEfficient import QEFFAutoModelForCausalLM from transformers import AutoTokenizer model = QEFFAutoModelForCausalLM.from_pretrained("gpt2") model.compile(num_cores=16) tokenizer = AutoTokenizer.from_pretrained("gpt2") model.generate(prompts=["Hi there!!"], tokenizer=tokenizer)
High-Level API
- classmethod QEFFAutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path, continuous_batching: bool = False, qaic_config: dict | None = None, max_seq_len_cached: int | None = None, *args, **kwargs)[source]
Load a QEfficient Causal Language Model from a pretrained HuggingFace model or local path.
This is the recommended way to initialize a QEfficient Causal Language Model. The interface is similar to
transformers.AutoModelForCausalLM.from_pretrained. Once initialized, you can use methods such asexport,compile, andgenerate.- Parameters:
pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
continuous_batching (bool, optional) – Whether this model will be used for continuous batching in the future. If not set to True here, the model cannot be exported/compiled for continuous batching later. Default is False.
qaic_config (dict, optional) –
QAIC config dictionary. Supported keys include:
speculative_model_type (str): Specify Speculative Decoding Target Language Models.
include_sampler (bool): Enable/Disable sampling of next tokens.
return_pdfs (bool): Return probability distributions along with sampled next tokens. For Speculative Decoding Target Language Model,
return_pdfs=Truealways. Otherwise,return_pdfs=Truefor Speculative Decoding Draft Language Model andreturn_pdfs=Falsefor regular model.max_top_k_ids (int): Maximum number of top K tokens (<= vocab size) to consider during sampling. The values provided in
top_kstensor must be less than this maximum limit.include_guided_decoding (bool): If True, enables guided token-level filtering during decoding. Only works when include_sampler=True.
*args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.
**kwargs –
Additional keyword arguments passed directly to cls._hf_auto_class.from_pretrained.
Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.
- Returns:
An instance initialized with the pretrained weights.
- Return type:
QEFFAutoModelForCausalLM
- QEFFAutoModelForCausalLM.export(export_dir: str | None = None, prefill_only: bool | None = False, prefill_seq_len: int | None = None, **kwargs) str[source]
Export the model to ONNX format using
torch.onnx.export.This method prepares example inputs and dynamic axes based on the model configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware. It handles KV cache inputs/outputs and sampler-related inputs.
- Parameters:
export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.
use_onnx_subfunctions (bool, optional) – whether to enable ONNX subfunctions during export. Exporting PyTorch model to ONNX with modules as subfunctions helps to reduce export/compile time. Defaults to False
- Returns:
Path to the generated ONNX graph file.
- Return type:
str
- QEFFAutoModelForCausalLM.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, prefill_seq_len: int = 32, ctx_len: int = 128, comp_ctx_lengths_prefill: List[int] | None = None, comp_ctx_lengths_decode: List[int] | None = None, batch_size: int = 1, full_batch_size: int | None = None, kv_cache_batch_size: int | None = None, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, num_speculative_tokens: int | None = None, prefill_only: bool | None = None, use_onnx_subfunctions: bool = False, offload_pt_weights: bool | None = True, enable_chunking: bool | None = False, retain_full_kv: bool | None = None, **compiler_options) str[source]
Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.
This method generates a
qpcpackage. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-compile compiler can be passed as keyword arguments.- Parameters:
onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.
compile_dir (str, optional) – Directory to save the generated QPC package. If not provided, a default directory is used.
prefill_seq_len (int, optional) – Length of the prefill prompt. Default is 32.
ctx_len (int, optional) – Maximum context length the compiled model can remember. Default is 128.
batch_size (int, optional) – Batch size. Default is 1.
full_batch_size (int, optional) – Continuous batching batch size. Required if continuous_batching=True was set during from_pretrained.
kv_cache_batch_size (int, optional) – Batch size for KV cache. If not provided, it defaults to full_batch_size (if continuous batching) or batch_size.
num_devices (int, optional) – Number of devices to compile for. Default is 1.
num_cores (int, optional) – Number of cores to use for compilation.
mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.
mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.
num_speculative_tokens (int, optional) – Number of speculative tokens for Speculative Decoding Target Language Model. Required if the model is configured as a Target Language Model (is_tlm=True).
prefill_only (bool, optional) – If True, compiles only for the prefill stage. If False, compiles only for the decode stage. If None, compiles for both stages. Default is None.
use_onnx_subfunctions (bool, optional) – whether to enable ONNX subfunctions during export. Exporting PyTorch model to ONNX with modules as subfunctions helps to reduce export/compile time. Defaults to False
**compiler_options (dict) –
Additional compiler options for QAIC or QNN compilers.
For QAIC Compiler: Extra arguments for qaic-compile can be passed. Some common options include:
mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.
aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.
allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.
Params are converted to flags as below:
aic_num_cores=16->-aic-num-cores=16convert_to_fp16=True->-convert-to-fp16
For QNN Compiler: Following arguments can be passed as:
enable_qnn (bool): Enables QNN Compilation.
qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.
- Returns:
Path to the compiled QPC package.
- Return type:
str
- Raises:
TypeError – If prefill_only is not a boolean. If full_batch_size is None when continuous_batching is True. If num_speculative_tokens is None when the model is a TLM.
ValueError – If KV caching is requested without continuous batching (full_batch_size). If include_sampler is True and num_speculative_tokens is greater than 0. If num_speculative_tokens is not an integer greater than 1. If prefill_seq_len is less than num_speculative_tokens + 1 for TLM models.
- QEFFAutoModelForCausalLM.generate(tokenizer: PreTrainedTokenizerFast | PreTrainedTokenizer, prompts: List[str], device_id: List[int] | None = None, runtime_ai100: bool = True, **kwargs)[source]
Generate output by executing the compiled QPC on Cloud AI 100 hardware.
This method runs sequential execution based on the compiled model’s batch size and the number of prompts. If the number of prompts is not divisible by the batch size, the last batch will be dropped.
- Parameters:
tokenizer (PreTrainedTokenizer or PreTrainedTokenizerFast) – Tokenizer for the model.
prompts (list of str) – List of prompts to generate output for.
device_id (list of int, optional) – Device IDs for running the QPC. Defaults to [0] if not specified.
runtime_ai100 (bool, optional) – Whether to use AI 100 runtime. Default is True.
**kwargs – Additional keyword arguments. Currently supports: - generation_len (int, optional): The maximum number of tokens to generate.
- Returns:
Output from the AI 100 runtime, containing generated IDs and performance metrics.
- Return type:
CloudAI100ExecInfoNew
- Raises:
TypeError – If the QPC path is not set (i.e., compile was not run).
NotImplementedError – If runtime_ai100 is False.
QEFFAutoModel
- class QEfficient.transformers.models.modeling_auto.QEFFAutoModel(model: Module, pooling=None, **kwargs)[source]
QEfficient class for general transformer models from the HuggingFace hub (e.g., BERT, Sentence Transformers).
This class provides a unified interface for loading, exporting, compiling, and running various encoder-only transformer models on Cloud AI 100 hardware. It supports pooling for embedding extraction.
Example
from QEfficient import QEFFAutoModel from transformers import AutoTokenizer model = QEFFAutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", pooling="mean") model.compile(num_cores=16) tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") inputs = tokenizer("My name is", return_tensors="pt") output = model.generate(inputs) print(output) # Output will be a dictionary containing extracted features.
High-Level API
- classmethod QEFFAutoModel.from_pretrained(pretrained_model_name_or_path, pooling=None, *args, **kwargs)[source]
Load a QEfficient transformer model from a pretrained HuggingFace model or local path.
This is the recommended way to initialize a QEfficient transformer model. The interface is similar to
transformers.AutoModel.from_pretrained. Once initialized, you can use methods such asexport,compile, andgenerate.- Parameters:
pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
pooling (str or Callable, optional) – The pooling method to use. Options include: - “mean”: Mean pooling - “max”: Max pooling - “cls”: CLS token pooling - “avg”: Average pooling - Callable: A custom pooling function - None: No pooling applied. Default is None.
*args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.
**kwargs –
Additional keyword arguments passed directly to cls._hf_auto_class.from_pretrained.
Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.
- Returns:
An instance initialized with the pretrained weights.
- Return type:
QEFFAutoModel
- QEFFAutoModel.export(export_dir: str | None = None, **kwargs) str[source]
Export the model to ONNX format using
torch.onnx.export.This method prepares example inputs and dynamic axes based on the model configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware.
- Parameters:
export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.
use_onnx_subfunctions (bool, optional) – whether to enable ONNX subfunctions during export. Exporting PyTorch model to ONNX with modules as subfunctions helps to reduce export/compile time. Defaults to False
- Returns:
Path to the generated ONNX graph file.
- Return type:
str
- QEFFAutoModel.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, seq_len: int | List[int] = 32, batch_size: int = 1, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, use_onnx_subfunctions: bool = False, **compiler_options) str[source]
Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.
This method generates a
qpcpackage. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-compile compiler can be passed as keyword arguments.- Parameters:
onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.
compile_dir (str, optional) – Directory to save the generated QPC package. If not provided, a default directory is used.
seq_len (int or list of int, optional) – The length(s) of the prompt(s) to compile for. Can be a single integer or a list of integers to create multiple specializations. Default is 32.
batch_size (int, optional) – Batch size. Default is 1.
num_devices (int, optional) – Number of devices to compile for. Default is 1.
num_cores (int, optional) – Number of cores to use for compilation.
mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.
use_onnx_subfunctions (bool, optional) – whether to enable ONNX subfunctions during export. Exporting PyTorch model to ONNX with modules as subfunctions helps to reduce export/compile time. Defaults to False
**compiler_options (dict) –
Additional compiler options for QAIC or QNN compilers. These are passed directly to the underlying compilation command.
For QAIC Compiler: Extra arguments for qaic-compile can be passed. Some common options include:
mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.
aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.
allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.
Params are converted to flags as below:
aic_num_cores=16->-aic-num-cores=16convert_to_fp16=True->-convert-to-fp16
For QNN Compiler: Following arguments can be passed as:
enable_qnn (bool): Enables QNN Compilation.
qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.
- Returns:
Path to the compiled QPC package.
- Return type:
str
- QEFFAutoModel.generate(inputs: Tensor, device_ids: List[int] | None = None, runtime_ai100: bool = True) Tensor | ndarray[source]
Generate output by executing the compiled QPC on Cloud AI 100 hardware or using PyTorch runtime.
This method runs sequential execution based on the compiled model’s batch size and the number of prompts. If the number of prompts is not divisible by the batch size, the last batch will be dropped.
- Parameters:
inputs (torch.Tensor or np.ndarray) – Input data for the model. For AI 100 runtime, this typically includes input_ids and attention_mask.
device_ids (list of int, optional) – Device IDs for running the QPC. Defaults to [0] if not specified and runtime_ai100 is True.
runtime_ai100 (bool, optional) – Whether to use the AI 100 runtime for inference. If False, the PyTorch runtime will be used. Default is True.
- Returns:
Output from the AI 100 or PyTorch runtime. The type depends on the runtime and model.
- Return type:
torch.Tensor or np.ndarray
QEffAutoPeftModelForCausalLM
- class QEfficient.peft.auto.QEffAutoPeftModelForCausalLM(model: Module)[source]
QEfficient class for loading and running Causal Language Models with PEFT adapters (currently only LoRA is supported).
This class enables efficient inference and deployment of PEFT-adapted models on Cloud AI 100 hardware. Once exported and compiled for an adapter, the same base model can be reused with other compatible adapters.
Example
from transformers import AutoTokenizer, TextStreamer from QEfficient import QEffAutoPeftModelForCausalLM base_model_name = "mistralai/Mistral-7B-v0.1" tokenizer = AutoTokenizer.from_pretrained(base_model_name) streamer = TextStreamer(tokenizer) m = QEffAutoPeftModelForCausalLM.from_pretrained("predibase/magicoder", "magicoder") m.export() m.compile(prefill_seq_len=32, ctx_len=1024) # Magicoder adapter m.set_adapter("magicoder") inputs = tokenizer("def fibonacci", return_tensors="pt") m.generate(**inputs, streamer=streamer, max_new_tokens=1024) # Math problems m.load_adapter("predibase/gsm8k", "gsm8k") m.set_adapter("gsm8k") inputs = tokenizer("James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?",return_tensors="pt") m.generate(**inputs, streamer=streamer, max_new_tokens=1024)
High-Level API
- classmethod QEffAutoPeftModelForCausalLM.from_pretrained(pretrained_name_or_path: str, *args, **kwargs)[source]
Load a QEffAutoPeftModelForCausalLM from a pretrained model and adapter.
- Parameters:
pretrained_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
finite_adapters (bool, optional) – Set True to enable finite adapter mode with QEffAutoLoraModelForCausalLM class.
adapter_name (str, optional) – Name used to identify the loaded adapter.
*args – Additional positional arguments for peft.AutoPeftModelForCausalLM.
**kwargs – Additional keyword arguments for peft.AutoPeftModelForCausalLM.
- Returns:
An instance initialized with the pretrained weights and adapter.
- Return type:
QEffAutoPeftModelForCausalLM
- Raises:
NotImplementedError – If continuous batching is requested (not supported).
TypeError – If adapter name is missing in finite adapter mode.
- QEffAutoPeftModelForCausalLM.export(export_dir: str | None = None, **kwargs) str[source]
Export the model with the active adapter to ONNX format.
This method prepares example inputs and dynamic axes based on the model and adapter configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware.
- Parameters:
export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.
- Returns:
Path to the generated ONNX graph file.
- Return type:
str
- QEffAutoPeftModelForCausalLM.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, batch_size: int = 1, prefill_seq_len: int, ctx_len: int, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, use_onnx_subfunctions: bool = False, **compiler_options) str[source]
Compile the exported ONNX model for Cloud AI 100 hardware.
This method generates a QPC package. If the model has not been exported yet, this method will handle the export process. Additional arguments for the QAIC compiler can be passed as keyword arguments.
- Parameters:
onnx_path (str, optional) – Path to a pre-exported ONNX model.
compile_dir (str, optional) – Directory to save the generated QPC package.
batch_size (int, optional) – Batch size for compilation. Default is 1.
prefill_seq_len (int) – Length of the prefill prompt.
ctx_len (int) – Maximum context length the compiled model can remember.
num_devices (int, optional) – Number of devices to compile for. Default is 1.
num_cores (int, optional) – Number of cores to use for compilation. Default is 16.
mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.
mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.
**compiler_options –
Additional compiler options for QAIC.
For QAIC Compiler: Extra arguments for qaic-compile can be passed. Some common options include:
mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.
aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.
allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.
Params are converted to flags as below:
aic_num_cores=16->-aic-num-cores=16convert_to_fp16=True->-convert-to-fp16
For QNN Compiler: Following arguments can be passed as:
enable_qnn (bool): Enables QNN Compilation.
qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.
- Returns:
Path to the compiled QPC package.
- Return type:
str
- QEffAutoPeftModelForCausalLM.generate(inputs: Tensor | ndarray | None = None, device_ids: List[int] | None = None, generation_config: GenerationConfig | None = None, stopping_criteria: StoppingCriteria | None = None, streamer: BaseStreamer | None = None, **kwargs) ndarray[source]
Generate tokens from the compiled binary using the active adapter.
This method takes similar parameters as HuggingFace’s
model.generate()method.- Parameters:
inputs (torch.Tensor or np.ndarray, optional) – Input IDs for generation.
device_ids (List[int], optional) – Device IDs for running inference.
generation_config (GenerationConfig, optional) – Generation configuration to merge with model-specific config.
stopping_criteria (StoppingCriteria, optional) – Custom stopping criteria for generation.
streamer (BaseStreamer, optional) – Streamer to receive generated tokens.
**kwargs – Additional parameters for generation_config or to be passed to the model.
- Returns:
Generated token IDs.
- Return type:
np.ndarray
QEffAutoLoraModelForCausalLM
- class QEfficient.peft.lora.auto.QEffAutoLoraModelForCausalLM(model: Module, continuous_batching: bool = False, **kwargs)[source]
QEfficient class for loading models with multiple LoRA adapters for causal language modeling.
This class enables mixed batch inference with different adapters on Cloud AI 100 hardware. Currently, only Mistral and Llama models are supported. Once exported and compiled, the QPC can perform mixed batch inference using the prompt_to_adapter_mapping argument.
Example
from QEfficient.peft.lora import QEffAutoLoraModelForCausalLM from transformers import AutoTokenizer m = QEffAutoLoraModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", num_hidden_layers=1) tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") m.load_adapter("predibase/gsm8k", "gsm8k") m.load_adapter("predibase/magicoder", "magicoder") m.compile() prompts = ["code prompt", "math prompt", "generic"] m.generate(prompts=prompts, tokenizer=tokenizer,prompt_to_adapter_mapping=["magicoder", "gsm8k", "base"])
High-Level API
- classmethod QEffAutoLoraModelForCausalLM.from_pretrained(pretrained_model_name_or_path, continuous_batching: bool = False, qaic_config: dict | None = None, max_seq_len_cached: int | None = None, *args, **kwargs)
Load a QEfficient Causal Language Model from a pretrained HuggingFace model or local path.
This is the recommended way to initialize a QEfficient Causal Language Model. The interface is similar to
transformers.AutoModelForCausalLM.from_pretrained. Once initialized, you can use methods such asexport,compile, andgenerate.- Parameters:
pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
continuous_batching (bool, optional) – Whether this model will be used for continuous batching in the future. If not set to True here, the model cannot be exported/compiled for continuous batching later. Default is False.
qaic_config (dict, optional) –
QAIC config dictionary. Supported keys include:
speculative_model_type (str): Specify Speculative Decoding Target Language Models.
include_sampler (bool): Enable/Disable sampling of next tokens.
return_pdfs (bool): Return probability distributions along with sampled next tokens. For Speculative Decoding Target Language Model,
return_pdfs=Truealways. Otherwise,return_pdfs=Truefor Speculative Decoding Draft Language Model andreturn_pdfs=Falsefor regular model.max_top_k_ids (int): Maximum number of top K tokens (<= vocab size) to consider during sampling. The values provided in
top_kstensor must be less than this maximum limit.include_guided_decoding (bool): If True, enables guided token-level filtering during decoding. Only works when include_sampler=True.
*args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.
**kwargs –
Additional keyword arguments passed directly to cls._hf_auto_class.from_pretrained.
Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.
- Returns:
An instance initialized with the pretrained weights.
- Return type:
QEFFAutoModelForCausalLM
- QEffAutoLoraModelForCausalLM.export(export_dir: str | None = None, **kwargs) str[source]
Export the model with all loaded adapters to ONNX format using
torch.onnx.export.The exported ONNX graph will support mixed batch inference with multiple adapters.
- Parameters:
export_dir (str, optional) – Directory to save the exported ONNX graph. If not provided, the default export directory is used.
- Returns:
Path to the generated ONNX graph.
- Return type:
str
- Raises:
ValueError – If no adapters are loaded.
- QEffAutoLoraModelForCausalLM.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, prefill_seq_len: int = 32, ctx_len: int = 128, comp_ctx_lengths_prefill: List[int] | None = None, comp_ctx_lengths_decode: List[int] | None = None, batch_size: int = 1, full_batch_size: int | None = None, kv_cache_batch_size: int | None = None, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, num_speculative_tokens: int | None = None, prefill_only: bool | None = None, use_onnx_subfunctions: bool = False, offload_pt_weights: bool | None = True, enable_chunking: bool | None = False, retain_full_kv: bool | None = None, **compiler_options) str
Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.
This method generates a
qpcpackage. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-compile compiler can be passed as keyword arguments.- Parameters:
onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.
compile_dir (str, optional) – Directory to save the generated QPC package. If not provided, a default directory is used.
prefill_seq_len (int, optional) – Length of the prefill prompt. Default is 32.
ctx_len (int, optional) – Maximum context length the compiled model can remember. Default is 128.
batch_size (int, optional) – Batch size. Default is 1.
full_batch_size (int, optional) – Continuous batching batch size. Required if continuous_batching=True was set during from_pretrained.
kv_cache_batch_size (int, optional) – Batch size for KV cache. If not provided, it defaults to full_batch_size (if continuous batching) or batch_size.
num_devices (int, optional) – Number of devices to compile for. Default is 1.
num_cores (int, optional) – Number of cores to use for compilation.
mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.
mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.
num_speculative_tokens (int, optional) – Number of speculative tokens for Speculative Decoding Target Language Model. Required if the model is configured as a Target Language Model (is_tlm=True).
prefill_only (bool, optional) – If True, compiles only for the prefill stage. If False, compiles only for the decode stage. If None, compiles for both stages. Default is None.
use_onnx_subfunctions (bool, optional) – whether to enable ONNX subfunctions during export. Exporting PyTorch model to ONNX with modules as subfunctions helps to reduce export/compile time. Defaults to False
**compiler_options (dict) –
Additional compiler options for QAIC or QNN compilers.
For QAIC Compiler: Extra arguments for qaic-compile can be passed. Some common options include:
mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.
aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.
allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.
Params are converted to flags as below:
aic_num_cores=16->-aic-num-cores=16convert_to_fp16=True->-convert-to-fp16
For QNN Compiler: Following arguments can be passed as:
enable_qnn (bool): Enables QNN Compilation.
qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.
- Returns:
Path to the compiled QPC package.
- Return type:
str
- Raises:
TypeError – If prefill_only is not a boolean. If full_batch_size is None when continuous_batching is True. If num_speculative_tokens is None when the model is a TLM.
ValueError – If KV caching is requested without continuous batching (full_batch_size). If include_sampler is True and num_speculative_tokens is greater than 0. If num_speculative_tokens is not an integer greater than 1. If prefill_seq_len is less than num_speculative_tokens + 1 for TLM models.
- QEffAutoLoraModelForCausalLM.generate(tokenizer: PreTrainedTokenizerFast | PreTrainedTokenizer, prompts: List[str], prompt_to_adapter_mapping: List[str] | None = None, device_id: List[int] | None = None, runtime: str | None = 'AI_100', **kwargs)[source]
Generate output for a batch of prompts using the compiled QPC on Cloud AI 100 hardware.
This method supports mixed batch inference, where each prompt can use a different adapter as specified by prompt_to_adapter_mapping. If the number of prompts is not divisible by the compiled batch size, the last incomplete batch will be dropped.
- Parameters:
tokenizer (PreTrainedTokenizerFast or PreTrainedTokenizer) – Tokenizer used for inference.
prompts (List[str]) – List of prompts to generate outputs for.
prompt_to_adapter_mapping (List[str]) – List of adapter names to use for each prompt. Use “base” for the base model (no adapter).
device_id (List[int], optional) – Device IDs to use for execution. If None, auto-device-picker is used.
runtime (str, optional) – Runtime to use. Only “AI_100” is currently supported. Default is “AI_100”.
**kwargs – Additional generation parameters.
- Returns:
Model outputs for each prompt.
- Raises:
ValueError – If runtime is not “AI_100”.
TypeError – If the model has not been compiled.
RuntimeError – If the number of prompts does not match the number of adapter mappings.
QEFFAutoModelForImageTextToText
- class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForImageTextToText(model: Module, kv_offload: bool | None = True, continuous_batching: bool = False, qaic_config: dict | None = None, **kwargs)[source]
QEfficient class for multimodal (image-text-to-text) models from the HuggingFace hub.
This class supports both single and dual QPC (Quantized Package Compilation) approaches for efficient deployment on Cloud AI 100 hardware. It is recommended to use the
from_pretrainedmethod for initialization.Example
import requests from PIL import Image from transformers import AutoProcessor, TextStreamer from QEfficient import QEFFAutoModelForImageTextToText HF_TOKEN = "" # Your HuggingFace token if needed model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct" query = "Describe this image." image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" # STEP 1: Load processor and model processor = AutoProcessor.from_pretrained(model_name, token=HF_TOKEN) model = QEFFAutoModelForImageTextToText.from_pretrained( model_name, token=HF_TOKEN, attn_implementation="eager", kv_offload=False # kv_offload=False for single QPC ) # STEP 2: Export & Compile model.compile( prefill_seq_len=32, ctx_len=512, img_size=560, num_cores=16, num_devices=1, mxfp6_matmul=False, ) # STEP 3: Prepare inputs image = Image.open(requests.get(image_url, stream=True).raw) messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": query}, ], } ] input_text = [processor.apply_chat_template(messages, add_generation_prompt=True)] inputs = processor( text=input_text, images=image, return_tensors="pt", add_special_tokens=False, padding="max_length", # Consider padding strategy if max_length is crucial max_length=32, ) # STEP 4: Run inference streamer = TextStreamer(processor.tokenizer) model.generate(inputs=inputs, streamer=streamer, generation_len=512)
High-Level API
- classmethod QEFFAutoModelForImageTextToText.from_pretrained(pretrained_model_name_or_path: str, kv_offload: bool | None = None, continuous_batching: bool = False, qaic_config: dict | None = None, **kwargs)[source]
Load a QEfficient image-text-to-text model from a pretrained HuggingFace model or local path.
- Parameters:
pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
kv_offload (bool, optional) – If True, uses the dual QPC approach (vision encoder KV offloaded). If False, uses the single QPC approach (entire model in one QPC). If None, the default behavior of the internal classes is used (typically dual QPC).
qaic_config (dict, optional) – A dictionary for QAIC-specific configurations.
**kwargs –
Additional arguments passed to HuggingFace’s
from_pretrained.Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility. continuous_batching is not supported for image-text-to-text models.
- Returns:
An instance initialized with the pretrained weights, wrapped for QEfficient.
- Return type:
QEFFAutoModelForImageTextToText
- Raises:
NotImplementedError – If continuous_batching is provided as True.
QEFFAutoModelForSpeechSeq2Seq
- class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForSpeechSeq2Seq(*args, **kwargs)[source]
QEfficient class for sequence-to-sequence speech-to-text models (e.g., Whisper, Encoder-Decoder speech models).
This class enables efficient export, compilation, and inference of speech models on Cloud AI 100 hardware. It is recommended to use the
from_pretrainedmethod for initialization.Example
from datasets import load_dataset from transformers import AutoProcessor from QEfficient import QEFFAutoModelForSpeechSeq2Seq base_model_name = "openai/whisper-tiny" ## STEP 1 -- load audio sample, using a standard english dataset, can load specific files if longer audio needs to be tested; also load initial processor ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") data = ds[0]["audio"]["array"] # reshape to so shape corresponds to data with batch size 1 data = data.reshape(-1) sample_rate = ds[0]["audio"]["sampling_rate"] processor = AutoProcessor.from_pretrained(base_model_name) ## STEP 2 -- init base model qeff_model = QEFFAutoModelForSpeechSeq2Seq.from_pretrained(base_model_name) ## STEP 3 -- export and compile model qeff_model.compile() ## STEP 4 -- generate output for loaded input and processor exec_info = qeff_model.generate(inputs=processor(data, sampling_rate=sample_rate, return_tensors="pt"), generation_len=25) ## STEP 5 (optional) -- use processor to decode output print(processor.batch_decode(exec_info.generated_ids)[0])
High-Level API
- classmethod QEFFAutoModelForSpeechSeq2Seq.from_pretrained(pretrained_model_name_or_path: str, *args, **kwargs)
Load a QEfficient transformer model from a pretrained HuggingFace model or local path.
This is the recommended way to initialize any QEfficient transformer model. The interface is similar to
transformers.AutoModel.from_pretrained.- Parameters:
pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.
*args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.
**kwargs –
Keyword arguments passed directly to cls._hf_auto_class.from_pretrained.
Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.
- Returns:
An instance of the specific QEFFAutoModel subclass, initialized with the pretrained weights.
- Return type:
QEFFTransformersBase
- QEFFAutoModelForSpeechSeq2Seq.export(export_dir: str | None = None, **kwargs) str[source]
Export the model to ONNX format using
torch.onnx.export.This method prepares example inputs and dynamic axes based on the model configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware.
- Parameters:
export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.
use_onnx_subfunctions (bool, optional) – whether to enable ONNX subfunctions during export. Exporting PyTorch model to ONNX with modules as subfunctions helps to reduce export/compile time. Defaults to False
- Returns:
Path to the generated ONNX graph file.
- Return type:
str
- QEFFAutoModelForSpeechSeq2Seq.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, prefill_seq_len: int | None = 1, encoder_ctx_len: int | None = None, ctx_len: int = 150, full_batch_size: int | None = None, kv_cache_batch_size: int | None = None, batch_size: int = 1, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, num_speculative_tokens: int | None = None, use_onnx_subfunctions: bool = False, **compiler_options) str[source]
Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.
This method generates a
qpcpackage. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-compile compiler can be passed as keyword arguments.- Parameters:
onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.
compile_dir (str, optional) – Directory to save the generated QPC package.
prefill_seq_len (int, optional) – Prefill sequence length. This parameter is typically not critically used for SpeechSeq2Seq models’ decoder compilation as the first decoder input is seq_len=1. Default is 1.
encoder_ctx_len (int, optional) – Maximum context length for the encoder part of the model. If None, it’s inferred from the model configuration or defaults (e.g., 1500 for Whisper).
ctx_len (int, optional) – Maximum decoder context length. This defines the maximum output sequence length the compiled model can handle. Default is 150.
batch_size (int, optional) – Batch size. Default is 1.
num_devices (int, optional) – Number of devices to compile for. Default is 1.
num_cores (int, optional) – Number of cores to use for compilation.
mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.
mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.
full_batch_size (int, optional) – Not yet supported for this model.
kv_cache_batch_size (int, optional) – Not yet supported for this model.
num_speculative_tokens (int, optional) – Not yet supported for this model.
use_onnx_subfunctions (bool, optional) – whether to enable ONNX subfunctions during export. Exporting PyTorch model to ONNX with modules as subfunctions helps to reduce export/compile time. Defaults to False
**compiler_options (dict) –
Additional compiler options for QAIC.
For QAIC Compiler: Extra arguments for qaic-compile can be passed. Some common options include:
mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.
aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.
allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.
Params are converted to flags as below:
aic_num_cores=16->-aic-num-cores=16convert_to_fp16=True->-convert-to-fp16
- Returns:
Path to the compiled QPC package.
- Return type:
str
- QEFFAutoModelForSpeechSeq2Seq.generate(inputs: Tensor, generation_len: int, streamer: TextStreamer | None = None, device_ids: List[int] | None = None) Tensor | ndarray[source]
Generate output until
<|endoftext|>token or generation_len is reached, by executing the compiled QPC on Cloud AI 100 hardware.This method performs sequential execution based on the compiled model’s batch size and the provided audio tensors. It manages the iterative decoding process and KV cache.
- Parameters:
inputs (Dict[str, np.ndarray]) – Model inputs for inference, typically a dictionary containing: - input_features (np.ndarray): Preprocessed audio features. - decoder_input_ids (np.ndarray): Initial decoder input IDs (e.g., start token). - decoder_position_ids (np.ndarray): Initial decoder position IDs. These should be prepared to match the compiled model’s expectations.
generation_len (int) – Maximum number of tokens to generate. The generation stops if this limit is reached or the model generates an end-of-sequence token.
streamer (TextStreamer, optional) – Streamer to receive generated tokens in real-time. Default is None.
device_ids (List[int], optional) – Device IDs for running the QPC. Defaults to [0] if not specified.
- Returns:
Output from the AI 100 runtime, including generated IDs and performance metrics.
- Return type:
CloudAI100ExecInfoNew
- Raises:
TypeError – If the QPC path is not set (i.e., compile was not run).
QEFFAutoModelForCTC
- class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCTC(model: Module, **kwargs)[source]
The QEFFAutoModelForCTC class is designed for transformer models with a Connectionist Temporal Classification (CTC) speech-to-text head, including Wav2Vec2 and other encoder-only speech models optimized for alignment-free transcription. Although it is possible to initialize the class directly, we highly recommend using the
from_pretrainedmethod for initialization.Example
import torchaudio from QEfficient import QEFFAutoModelForCTC from transformers import AutoProcessor # Initialize the model using from_pretrained similar to transformers.AutoModelForCTC. model=QEFFAutoModelForCTC.from_pretrained(model_name) # Now you can directly compile the model for Cloud AI 100 model.compile(num_cores=16) # Considering you have a Cloud AI 100 SKU #prepare input processor = AutoProcessor.from_pretrained(model_name) input_audio, sample_rate = [...] # audio data loaded in via some external audio package, such as librosa or soundfile # Resample the input_audio if necessary if input_audio.shape[0] > 1: input_audio = input_audio.mean(dim=0) if sample_rate != 16000: resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000) input_audio = resampler(input_audio) # You can now execute the model out = model.generate(processor,inputs=input_audio)
High-Level API
- classmethod QEFFAutoModelForCTC.from_pretrained(pretrained_model_name_or_path, pooling=None, *args, **kwargs)[source]
This method serves as the easiest entry point into using QEfficient. The interface is designed to be similar to transformers.AutoModelForCTC. Once the model is initialized, you can use other methods such as export, compile, and generate on the same object.
- Parameters:
pretrained_model_name_or_path (str) – The name or path of the pre-trained model.
import torchaudio from QEfficient import QEFFAutoModelForCTC from transformers import AutoProcessor
# Initialize the model using from_pretrained similar to transformers.AutoModelForCTC. model=QEFFAutoModelForCTC.from_pretrained(model_name)
# Now you can directly compile the model for Cloud AI 100 model.compile(num_cores=16) # Considering you have a Cloud AI 100 SKU
#prepare input processor = AutoProcessor.from_pretrained(model_name) input_audio, sample_rate = […] # audio data loaded in via some external audio package, such as librosa or soundfile
# Resample the input_audio if necessary if input_audio.shape[0] > 1:
input_audio = input_audio.mean(dim=0)
- if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000) input_audio = resampler(input_audio)
# You can now execute the model out = model.generate(processor,inputs=input_audio)
- QEFFAutoModelForCTC.export(export_dir: str | None = None, **kwargs) str[source]
Exports the model to
ONNXformat usingtorch.onnx.export.OptionalArgs:- export_dir (str, optional):
The directory path to store ONNX-graph.
- use_onnx_subfunctions:
bool, optional whether to enable ONNX subfunctions during export. Exporting PyTorch model to ONNX with modules as subfunctions helps to reduce export/compile time. Defaults to False
- Returns:
str: Path of the generated
ONNXgraph.
- QEFFAutoModelForCTC.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, seq_len: int | List[int] = 480000, batch_size: int = 1, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, use_onnx_subfunctions: bool = False, **compiler_options) str[source]
This method compiles the exported
ONNXmodel using the Cloud AI 100 Platform SDK compiler binary found at/opt/qti-aic/exec/qaic-compileand generates aqpcpackage. If the model has not been exported yet, this method will handle the export process. You can pass any other arguments that the qaic-compile takes as extra kwargs.OptionalArgs:- onnx_path (str, optional):
Path to pre-exported onnx model.
- compile_dir (str, optional):
Path for saving the qpc generated.
- seq_len (Union[int, List[int]]):
The length of the prompt should be less that
seq_len.Defaults to 32.- batch_size (int, optional):
Batch size.
Defaults to 1.- num_devices (int):
Number of devices the model needs to be compiled for. Defaults to 1.
- num_cores (int):
Number of cores used to compile the model.
- mxfp6_matmul (bool, optional):
Whether to use
mxfp6compression for weights.Defaults to False.- use_onnx_subfunctions:
bool, optional: whether to enable ONNX subfunctions during export. Exporting PyTorch model to ONNX with modules as subfunctions helps to reduce export/compile time. Defaults to False
- compiler_options (dict, optional):
Additional compiler options.
- For QAIC Compiler: Extra arguments for qaic-compile can be passed.
- aic_enable_depth_first (bool, optional):
Enables DFS with default memory size.
Defaults to False.- allow_mxint8_mdp_io (bool, optional):
Allows MXINT8 compression of MDP IO traffic.
Defaults to False.
Params are converted to flags as below:
aic_hw_version=ai100 -> -aic-hw-version=ai100
aic_hw_version=ai200 -> -aic-hw-version=ai200
- For QNN Compiler: Following arguments can be passed.
- enable_qnn (bool):
Enables QNN Compilation.
- qnn_config (str):
Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.
- Returns:
str: Path of the compiled
qpcpackage.
- QEFFAutoModelForCTC.generate(processor, inputs: Tensor, device_ids: List[int] | None = None, runtime_ai100: bool = True) Tensor | ndarray[source]
This method generates output by executing PyTorch runtime or the compiled
qpconCloud AI 100Hardware cards.MandatoryArgs:- inputs (Union[torch.Tensor, np.ndarray]):
inputs to run the execution.
- processor (AutoProcessor):
The Processor to use for encoding the waveform.
optionalArgs:- device_id (List[int]):
Ids of devices for running the qpc pass as [0] in case of normal model / [0, 1, 2, 3] in case of tensor slicing model
- runtime_ai100 (bool, optional):
AI_100andPyTorchruntime is supported as of now. Defaults toTrueforAI_100runtime.
- Returns:
dict: Output from the
AI_100orPyTorchruntime.