QEfficient Auto Classes

QEFFAutoModelForCausalLM

class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCausalLM(model: Module, continuous_batching: bool = False, qaic_config: dict | None = None, **kwargs)[source]

QEfficient class for Causal Language Models from the HuggingFace hub (e.g., GPT-2, Llama).

This class provides a unified interface for loading, exporting, compiling, and generating text with causal language models on Cloud AI 100 hardware. It supports features like continuous batching, speculative decoding (TLM), and on-device sampling.

Example

from QEfficient import QEFFAutoModelForCausalLM
from transformers import AutoTokenizer

model = QEFFAutoModelForCausalLM.from_pretrained("gpt2")
model.compile(num_cores=16)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model.generate(prompts=["Hi there!!"], tokenizer=tokenizer)

High-Level API

classmethod QEFFAutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path, continuous_batching: bool = False, qaic_config: dict | None = None, *args, **kwargs)[source]

Load a QEfficient Causal Language Model from a pretrained HuggingFace model or local path.

This is the recommended way to initialize a QEfficient Causal Language Model. The interface is similar to transformers.AutoModelForCausalLM.from_pretrained. Once initialized, you can use methods such as export, compile, and generate.

Parameters:
  • pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.

  • continuous_batching (bool, optional) – Whether this model will be used for continuous batching in the future. If not set to True here, the model cannot be exported/compiled for continuous batching later. Default is False.

  • qaic_config (dict, optional) –

    QAIC config dictionary. Supported keys include:

    • speculative_model_type (str): Specify Speculative Decoding Target Language Models.

    • include_sampler (bool): Enable/Disable sampling of next tokens.

    • return_pdfs (bool): Return probability distributions along with sampled next tokens. For Speculative Decoding Target Language Model, return_pdfs=True always. Otherwise, return_pdfs=True for Speculative Decoding Draft Language Model and return_pdfs=False for regular model.

    • max_top_k_ids (int): Maximum number of top K tokens (<= vocab size) to consider during sampling. The values provided in top_ks tensor must be less than this maximum limit.

  • *args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.

  • **kwargs

    Additional keyword arguments passed directly to cls._hf_auto_class.from_pretrained.

    Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.

Returns:

An instance initialized with the pretrained weights.

Return type:

QEFFAutoModelForCausalLM

QEFFAutoModelForCausalLM.export(export_dir: str | None = None) str[source]

Export the model to ONNX format using torch.onnx.export.

This method prepares example inputs and dynamic axes based on the model configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware. It handles KV cache inputs/outputs and sampler-related inputs.

Parameters:

export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.

Returns:

Path to the generated ONNX graph file.

Return type:

str

QEFFAutoModelForCausalLM.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, prefill_seq_len: int = 32, ctx_len: int = 128, batch_size: int = 1, full_batch_size: int | None = None, kv_cache_batch_size: int | None = None, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, num_speculative_tokens: int | None = None, prefill_only: bool | None = None, **compiler_options) str[source]

Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.

This method generates a qpc package. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-exec compiler can be passed as keyword arguments.

Parameters:
  • onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.

  • compile_dir (str, optional) – Directory to save the generated QPC package. If not provided, a default directory is used.

  • prefill_seq_len (int, optional) – Length of the prefill prompt. Default is 32.

  • ctx_len (int, optional) – Maximum context length the compiled model can remember. Default is 128.

  • batch_size (int, optional) – Batch size. Default is 1.

  • full_batch_size (int, optional) – Continuous batching batch size. Required if continuous_batching=True was set during from_pretrained.

  • kv_cache_batch_size (int, optional) – Batch size for KV cache. If not provided, it defaults to full_batch_size (if continuous batching) or batch_size.

  • num_devices (int, optional) – Number of devices to compile for. Default is 1.

  • num_cores (int, optional) – Number of cores to use for compilation.

  • mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.

  • mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.

  • num_speculative_tokens (int, optional) – Number of speculative tokens for Speculative Decoding Target Language Model. Required if the model is configured as a Target Language Model (is_tlm=True).

  • prefill_only (bool, optional) – If True, compiles only for the prefill stage. If False, compiles only for the decode stage. If None, compiles for both stages. Default is None.

  • **compiler_options (dict) –

    Additional compiler options for QAIC or QNN compilers.

    For QAIC Compiler: Extra arguments for qaic-exec can be passed. Some common options include:

    • mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.

    • aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.

    • allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.

    Params are converted to flags as below:

    • aic_num_cores=16 -> -aic-num-cores=16

    • convert_to_fp16=True -> -convert-to-fp16

    For QNN Compiler: Following arguments can be passed as:

    • enable_qnn (bool): Enables QNN Compilation.

    • qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.

Returns:

Path to the compiled QPC package.

Return type:

str

Raises:
  • TypeError – If prefill_only is not a boolean. If full_batch_size is None when continuous_batching is True. If num_speculative_tokens is None when the model is a TLM.

  • ValueError – If KV caching is requested without continuous batching (full_batch_size). If include_sampler is True and num_speculative_tokens is greater than 0. If num_speculative_tokens is not an integer greater than 1. If prefill_seq_len is less than num_speculative_tokens + 1 for TLM models.

QEFFAutoModelForCausalLM.generate(tokenizer: PreTrainedTokenizerFast | PreTrainedTokenizer, prompts: List[str], device_id: List[int] | None = None, runtime_ai100: bool = True, **kwargs)[source]

Generate output by executing the compiled QPC on Cloud AI 100 hardware.

This method runs sequential execution based on the compiled model’s batch size and the number of prompts. If the number of prompts is not divisible by the batch size, the last batch will be dropped.

Parameters:
  • tokenizer (PreTrainedTokenizer or PreTrainedTokenizerFast) – Tokenizer for the model.

  • prompts (list of str) – List of prompts to generate output for.

  • device_id (list of int, optional) – Device IDs for running the QPC. Defaults to [0] if not specified.

  • runtime_ai100 (bool, optional) – Whether to use AI 100 runtime. Default is True.

  • **kwargs – Additional keyword arguments. Currently supports: - generation_len (int, optional): The maximum number of tokens to generate.

Returns:

Output from the AI 100 runtime, containing generated IDs and performance metrics.

Return type:

CloudAI100ExecInfoNew

Raises:
  • TypeError – If the QPC path is not set (i.e., compile was not run).

  • NotImplementedError – If runtime_ai100 is False.


QEFFAutoModel

class QEfficient.transformers.models.modeling_auto.QEFFAutoModel(model: Module, pooling=None, **kwargs)[source]

QEfficient class for general transformer models from the HuggingFace hub (e.g., BERT, Sentence Transformers).

This class provides a unified interface for loading, exporting, compiling, and running various encoder-only transformer models on Cloud AI 100 hardware. It supports pooling for embedding extraction.

Example

from QEfficient import QEFFAutoModel
from transformers import AutoTokenizer

model = QEFFAutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", pooling="mean")
model.compile(num_cores=16)
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
inputs = tokenizer("My name is", return_tensors="pt")
output = model.generate(inputs)
print(output) # Output will be a dictionary containing extracted features.

High-Level API

classmethod QEFFAutoModel.from_pretrained(pretrained_model_name_or_path, pooling=None, *args, **kwargs)[source]

Load a QEfficient transformer model from a pretrained HuggingFace model or local path.

This is the recommended way to initialize a QEfficient transformer model. The interface is similar to transformers.AutoModel.from_pretrained. Once initialized, you can use methods such as export, compile, and generate.

Parameters:
  • pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.

  • pooling (str or Callable, optional) – The pooling method to use. Options include: - “mean”: Mean pooling - “max”: Max pooling - “cls”: CLS token pooling - “avg”: Average pooling - Callable: A custom pooling function - None: No pooling applied. Default is None.

  • *args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.

  • **kwargs

    Additional keyword arguments passed directly to cls._hf_auto_class.from_pretrained.

    Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.

Returns:

An instance initialized with the pretrained weights.

Return type:

QEFFAutoModel

QEFFAutoModel.export(export_dir: str | None = None) str[source]

Export the model to ONNX format using torch.onnx.export.

This method prepares example inputs and dynamic axes based on the model configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware.

Parameters:

export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.

Returns:

Path to the generated ONNX graph file.

Return type:

str

QEFFAutoModel.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, seq_len: int | List[int] = 32, batch_size: int = 1, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, **compiler_options) str[source]

Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.

This method generates a qpc package. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-exec compiler can be passed as keyword arguments.

Parameters:
  • onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.

  • compile_dir (str, optional) – Directory to save the generated QPC package. If not provided, a default directory is used.

  • seq_len (int or list of int, optional) – The length(s) of the prompt(s) to compile for. Can be a single integer or a list of integers to create multiple specializations. Default is 32.

  • batch_size (int, optional) – Batch size. Default is 1.

  • num_devices (int, optional) – Number of devices to compile for. Default is 1.

  • num_cores (int, optional) – Number of cores to use for compilation.

  • mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.

  • **compiler_options (dict) –

    Additional compiler options for QAIC or QNN compilers. These are passed directly to the underlying compilation command.

    For QAIC Compiler: Extra arguments for qaic-exec can be passed. Some common options include:

    • mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.

    • aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.

    • allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.

    Params are converted to flags as below:

    • aic_num_cores=16 -> -aic-num-cores=16

    • convert_to_fp16=True -> -convert-to-fp16

    For QNN Compiler: Following arguments can be passed as:

    • enable_qnn (bool): Enables QNN Compilation.

    • qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.

Returns:

Path to the compiled QPC package.

Return type:

str

QEFFAutoModel.generate(inputs: Tensor, device_ids: List[int] | None = None, runtime_ai100: bool = True) Tensor | ndarray[source]

Generate output by executing the compiled QPC on Cloud AI 100 hardware or using PyTorch runtime.

This method runs sequential execution based on the compiled model’s batch size and the number of prompts. If the number of prompts is not divisible by the batch size, the last batch will be dropped.

Parameters:
  • inputs (torch.Tensor or np.ndarray) – Input data for the model. For AI 100 runtime, this typically includes input_ids and attention_mask.

  • device_ids (list of int, optional) – Device IDs for running the QPC. Defaults to [0] if not specified and runtime_ai100 is True.

  • runtime_ai100 (bool, optional) – Whether to use the AI 100 runtime for inference. If False, the PyTorch runtime will be used. Default is True.

Returns:

Output from the AI 100 or PyTorch runtime. The type depends on the runtime and model.

Return type:

torch.Tensor or np.ndarray


QEffAutoPeftModelForCausalLM

class QEfficient.peft.auto.QEffAutoPeftModelForCausalLM(model: Module)[source]

QEfficient class for loading and running Causal Language Models with PEFT adapters (currently only LoRA is supported).

This class enables efficient inference and deployment of PEFT-adapted models on Cloud AI 100 hardware. Once exported and compiled for an adapter, the same base model can be reused with other compatible adapters.

Example

from transformers import AutoTokenizer, TextStreamer
from QEfficient import QEffAutoPeftModelForCausalLM

base_model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
streamer = TextStreamer(tokenizer)

m = QEffAutoPeftModelForCausalLM.from_pretrained("predibase/magicoder", "magicoder")
m.export()
m.compile(prefill_seq_len=32, ctx_len=1024)

# Magicoder adapter
m.set_adapter("magicoder")
inputs = tokenizer("def fibonacci", return_tensors="pt")
m.generate(**inputs, streamer=streamer, max_new_tokens=1024)

# Math problems
m.load_adapter("predibase/gsm8k", "gsm8k")
m.set_adapter("gsm8k")
inputs = tokenizer("James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?",return_tensors="pt")
m.generate(**inputs, streamer=streamer, max_new_tokens=1024)

High-Level API

classmethod QEffAutoPeftModelForCausalLM.from_pretrained(pretrained_name_or_path: str, *args, **kwargs)[source]

Load a QEffAutoPeftModelForCausalLM from a pretrained model and adapter.

Parameters:
  • pretrained_name_or_path (str) – Model card name from HuggingFace or local path to model directory.

  • finite_adapters (bool, optional) – Set True to enable finite adapter mode with QEffAutoLoraModelForCausalLM class.

  • adapter_name (str, optional) – Name used to identify the loaded adapter.

  • *args – Additional positional arguments for peft.AutoPeftModelForCausalLM.

  • **kwargs – Additional keyword arguments for peft.AutoPeftModelForCausalLM.

Returns:

An instance initialized with the pretrained weights and adapter.

Return type:

QEffAutoPeftModelForCausalLM

Raises:
  • NotImplementedError – If continuous batching is requested (not supported).

  • TypeError – If adapter name is missing in finite adapter mode.

QEffAutoPeftModelForCausalLM.export(export_dir: str | None = None) str[source]

Export the model with the active adapter to ONNX format.

This method prepares example inputs and dynamic axes based on the model and adapter configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware.

Parameters:

export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.

Returns:

Path to the generated ONNX graph file.

Return type:

str

QEffAutoPeftModelForCausalLM.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, batch_size: int = 1, prefill_seq_len: int, ctx_len: int, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, **compiler_options) str[source]

Compile the exported ONNX model for Cloud AI 100 hardware.

This method generates a QPC package. If the model has not been exported yet, this method will handle the export process. Additional arguments for the QAIC compiler can be passed as keyword arguments.

Parameters:
  • onnx_path (str, optional) – Path to a pre-exported ONNX model.

  • compile_dir (str, optional) – Directory to save the generated QPC package.

  • batch_size (int, optional) – Batch size for compilation. Default is 1.

  • prefill_seq_len (int) – Length of the prefill prompt.

  • ctx_len (int) – Maximum context length the compiled model can remember.

  • num_devices (int, optional) – Number of devices to compile for. Default is 1.

  • num_cores (int, optional) – Number of cores to use for compilation. Default is 16.

  • mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.

  • mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.

  • **compiler_options

    Additional compiler options for QAIC.

    For QAIC Compiler: Extra arguments for qaic-exec can be passed. Some common options include:

    • mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.

    • aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.

    • allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.

    Params are converted to flags as below:

    • aic_num_cores=16 -> -aic-num-cores=16

    • convert_to_fp16=True -> -convert-to-fp16

    For QNN Compiler: Following arguments can be passed as:

    • enable_qnn (bool): Enables QNN Compilation.

    • qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.

Returns:

Path to the compiled QPC package.

Return type:

str

QEffAutoPeftModelForCausalLM.generate(inputs: Tensor | ndarray | None = None, device_ids: List[int] | None = None, generation_config: GenerationConfig | None = None, stopping_criteria: StoppingCriteria | None = None, streamer: BaseStreamer | None = None, **kwargs) ndarray[source]

Generate tokens from the compiled binary using the active adapter.

This method takes similar parameters as HuggingFace’s model.generate() method.

Parameters:
  • inputs (torch.Tensor or np.ndarray, optional) – Input IDs for generation.

  • device_ids (List[int], optional) – Device IDs for running inference.

  • generation_config (GenerationConfig, optional) – Generation configuration to merge with model-specific config.

  • stopping_criteria (StoppingCriteria, optional) – Custom stopping criteria for generation.

  • streamer (BaseStreamer, optional) – Streamer to receive generated tokens.

  • **kwargs – Additional parameters for generation_config or to be passed to the model.

Returns:

Generated token IDs.

Return type:

np.ndarray


QEffAutoLoraModelForCausalLM

class QEfficient.peft.lora.auto.QEffAutoLoraModelForCausalLM(model: Module, continuous_batching: bool = False, **kwargs)[source]

QEfficient class for loading models with multiple LoRA adapters for causal language modeling.

This class enables mixed batch inference with different adapters on Cloud AI 100 hardware. Currently, only Mistral and Llama models are supported. Once exported and compiled, the QPC can perform mixed batch inference using the prompt_to_adapter_mapping argument.

Example

from QEfficient.peft.lora import QEffAutoLoraModelForCausalLM
from transformers import AutoTokenizer

m = QEffAutoLoraModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", num_hidden_layers=1)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
m.load_adapter("predibase/gsm8k", "gsm8k")
m.load_adapter("predibase/magicoder", "magicoder")
m.compile()

prompts = ["code prompt", "math prompt", "generic"]
m.generate(prompts=prompts, tokenizer=tokenizer,prompt_to_adapter_mapping=["magicoder", "gsm8k", "base"])

High-Level API

classmethod QEffAutoLoraModelForCausalLM.from_pretrained(pretrained_model_name_or_path, continuous_batching: bool = False, qaic_config: dict | None = None, *args, **kwargs)

Load a QEfficient Causal Language Model from a pretrained HuggingFace model or local path.

This is the recommended way to initialize a QEfficient Causal Language Model. The interface is similar to transformers.AutoModelForCausalLM.from_pretrained. Once initialized, you can use methods such as export, compile, and generate.

Parameters:
  • pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.

  • continuous_batching (bool, optional) – Whether this model will be used for continuous batching in the future. If not set to True here, the model cannot be exported/compiled for continuous batching later. Default is False.

  • qaic_config (dict, optional) –

    QAIC config dictionary. Supported keys include:

    • speculative_model_type (str): Specify Speculative Decoding Target Language Models.

    • include_sampler (bool): Enable/Disable sampling of next tokens.

    • return_pdfs (bool): Return probability distributions along with sampled next tokens. For Speculative Decoding Target Language Model, return_pdfs=True always. Otherwise, return_pdfs=True for Speculative Decoding Draft Language Model and return_pdfs=False for regular model.

    • max_top_k_ids (int): Maximum number of top K tokens (<= vocab size) to consider during sampling. The values provided in top_ks tensor must be less than this maximum limit.

  • *args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.

  • **kwargs

    Additional keyword arguments passed directly to cls._hf_auto_class.from_pretrained.

    Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.

Returns:

An instance initialized with the pretrained weights.

Return type:

QEFFAutoModelForCausalLM

QEffAutoLoraModelForCausalLM.export(export_dir: str | None = None) str[source]

Export the model with all loaded adapters to ONNX format using torch.onnx.export.

The exported ONNX graph will support mixed batch inference with multiple adapters.

Parameters:

export_dir (str, optional) – Directory to save the exported ONNX graph. If not provided, the default export directory is used.

Returns:

Path to the generated ONNX graph.

Return type:

str

Raises:

ValueError – If no adapters are loaded.

QEffAutoLoraModelForCausalLM.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, prefill_seq_len: int = 32, ctx_len: int = 128, batch_size: int = 1, full_batch_size: int | None = None, kv_cache_batch_size: int | None = None, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, num_speculative_tokens: int | None = None, prefill_only: bool | None = None, **compiler_options) str

Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.

This method generates a qpc package. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-exec compiler can be passed as keyword arguments.

Parameters:
  • onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.

  • compile_dir (str, optional) – Directory to save the generated QPC package. If not provided, a default directory is used.

  • prefill_seq_len (int, optional) – Length of the prefill prompt. Default is 32.

  • ctx_len (int, optional) – Maximum context length the compiled model can remember. Default is 128.

  • batch_size (int, optional) – Batch size. Default is 1.

  • full_batch_size (int, optional) – Continuous batching batch size. Required if continuous_batching=True was set during from_pretrained.

  • kv_cache_batch_size (int, optional) – Batch size for KV cache. If not provided, it defaults to full_batch_size (if continuous batching) or batch_size.

  • num_devices (int, optional) – Number of devices to compile for. Default is 1.

  • num_cores (int, optional) – Number of cores to use for compilation.

  • mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.

  • mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.

  • num_speculative_tokens (int, optional) – Number of speculative tokens for Speculative Decoding Target Language Model. Required if the model is configured as a Target Language Model (is_tlm=True).

  • prefill_only (bool, optional) – If True, compiles only for the prefill stage. If False, compiles only for the decode stage. If None, compiles for both stages. Default is None.

  • **compiler_options (dict) –

    Additional compiler options for QAIC or QNN compilers.

    For QAIC Compiler: Extra arguments for qaic-exec can be passed. Some common options include:

    • mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.

    • aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.

    • allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.

    Params are converted to flags as below:

    • aic_num_cores=16 -> -aic-num-cores=16

    • convert_to_fp16=True -> -convert-to-fp16

    For QNN Compiler: Following arguments can be passed as:

    • enable_qnn (bool): Enables QNN Compilation.

    • qnn_config (str): Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.

Returns:

Path to the compiled QPC package.

Return type:

str

Raises:
  • TypeError – If prefill_only is not a boolean. If full_batch_size is None when continuous_batching is True. If num_speculative_tokens is None when the model is a TLM.

  • ValueError – If KV caching is requested without continuous batching (full_batch_size). If include_sampler is True and num_speculative_tokens is greater than 0. If num_speculative_tokens is not an integer greater than 1. If prefill_seq_len is less than num_speculative_tokens + 1 for TLM models.

QEffAutoLoraModelForCausalLM.generate(tokenizer: PreTrainedTokenizerFast | PreTrainedTokenizer, prompts: List[str], prompt_to_adapter_mapping: List[str] | None = None, device_id: List[int] | None = None, runtime: str | None = 'AI_100', **kwargs)[source]

Generate output for a batch of prompts using the compiled QPC on Cloud AI 100 hardware.

This method supports mixed batch inference, where each prompt can use a different adapter as specified by prompt_to_adapter_mapping. If the number of prompts is not divisible by the compiled batch size, the last incomplete batch will be dropped.

Parameters:
  • tokenizer (PreTrainedTokenizerFast or PreTrainedTokenizer) – Tokenizer used for inference.

  • prompts (List[str]) – List of prompts to generate outputs for.

  • prompt_to_adapter_mapping (List[str]) – List of adapter names to use for each prompt. Use “base” for the base model (no adapter).

  • device_id (List[int], optional) – Device IDs to use for execution. If None, auto-device-picker is used.

  • runtime (str, optional) – Runtime to use. Only “AI_100” is currently supported. Default is “AI_100”.

  • **kwargs – Additional generation parameters.

Returns:

Model outputs for each prompt.

Raises:
  • ValueError – If runtime is not “AI_100”.

  • TypeError – If the model has not been compiled.

  • RuntimeError – If the number of prompts does not match the number of adapter mappings.


QEFFAutoModelForImageTextToText

class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForImageTextToText(model: Module, kv_offload: bool | None = True, **kwargs)[source]

QEfficient class for multimodal (image-text-to-text) models from the HuggingFace hub.

This class supports both single and dual QPC (Quantized Package Compilation) approaches for efficient deployment on Cloud AI 100 hardware. It is recommended to use the from_pretrained method for initialization.

Example

import requests
from PIL import Image
from transformers import AutoProcessor, TextStreamer
from QEfficient import QEFFAutoModelForImageTextToText

HF_TOKEN = "" # Your HuggingFace token if needed
model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
query = "Describe this image."
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"

# STEP 1: Load processor and model
processor = AutoProcessor.from_pretrained(model_name, token=HF_TOKEN)
model = QEFFAutoModelForImageTextToText.from_pretrained(
    model_name, token=HF_TOKEN, attn_implementation="eager", kv_offload=False # kv_offload=False for single QPC
)

# STEP 2: Export & Compile
model.compile(
    prefill_seq_len=32,
    ctx_len=512,
    img_size=560,
    num_cores=16,
    num_devices=1,
    mxfp6_matmul=False,
)

# STEP 3: Prepare inputs
image = Image.open(requests.get(image_url, stream=True).raw)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": query},
        ],
    }
]
input_text = [processor.apply_chat_template(messages, add_generation_prompt=True)]
inputs = processor(
    text=input_text,
    images=image,
    return_tensors="pt",
    add_special_tokens=False,
    padding="max_length", # Consider padding strategy if max_length is crucial
    max_length=32,
)

# STEP 4: Run inference
streamer = TextStreamer(processor.tokenizer)
model.generate(inputs=inputs, streamer=streamer, generation_len=512)

High-Level API

classmethod QEFFAutoModelForImageTextToText.from_pretrained(pretrained_model_name_or_path: str, kv_offload: bool | None = None, **kwargs)[source]

Load a QEfficient image-text-to-text model from a pretrained HuggingFace model or local path.

Parameters:
  • pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.

  • kv_offload (bool, optional) – If True, uses the dual QPC approach (vision encoder KV offloaded). If False, uses the single QPC approach (entire model in one QPC). If None, the default behavior of the internal classes is used (typically dual QPC).

  • **kwargs

    Additional arguments passed to HuggingFace’s from_pretrained.

    Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility. continuous_batching is not supported for image-text-to-text models.

Returns:

An instance initialized with the pretrained weights, wrapped for QEfficient.

Return type:

QEFFAutoModelForImageTextToText

Raises:

NotImplementedError – If continuous_batching is provided as True.


QEFFAutoModelForSpeechSeq2Seq

class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForSpeechSeq2Seq(*args, **kwargs)[source]

QEfficient class for sequence-to-sequence speech-to-text models (e.g., Whisper, Encoder-Decoder speech models).

This class enables efficient export, compilation, and inference of speech models on Cloud AI 100 hardware. It is recommended to use the from_pretrained method for initialization.

Example

from datasets import load_dataset
from transformers import AutoProcessor
from QEfficient import QEFFAutoModelForSpeechSeq2Seq

base_model_name = "openai/whisper-tiny"
## STEP 1 -- load audio sample, using a standard english dataset, can load specific files if longer audio needs to be tested; also load initial processor
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
data = ds[0]["audio"]["array"]
# reshape to so shape corresponds to data with batch size 1
data = data.reshape(-1)
sample_rate = ds[0]["audio"]["sampling_rate"]
processor = AutoProcessor.from_pretrained(base_model_name)

## STEP 2 -- init base model
qeff_model = QEFFAutoModelForSpeechSeq2Seq.from_pretrained(base_model_name)

## STEP 3 -- export and compile model
qeff_model.compile()

## STEP 4 -- generate output for loaded input and processor
exec_info = qeff_model.generate(inputs=processor(data, sampling_rate=sample_rate, return_tensors="pt"), generation_len=25)

## STEP 5 (optional) -- use processor to decode output
print(processor.batch_decode(exec_info.generated_ids)[0])

High-Level API

classmethod QEFFAutoModelForSpeechSeq2Seq.from_pretrained(pretrained_model_name_or_path: str, *args, **kwargs)

Load a QEfficient transformer model from a pretrained HuggingFace model or local path.

This is the recommended way to initialize any QEfficient transformer model. The interface is similar to transformers.AutoModel.from_pretrained.

Parameters:
  • pretrained_model_name_or_path (str) – Model card name from HuggingFace or local path to model directory.

  • *args – Positional arguments passed directly to cls._hf_auto_class.from_pretrained.

  • **kwargs

    Keyword arguments passed directly to cls._hf_auto_class.from_pretrained.

    Note: attn_implementation and low_cpu_mem_usage are automatically set to “eager” and False respectively to ensure compatibility.

Returns:

An instance of the specific QEFFAutoModel subclass, initialized with the pretrained weights.

Return type:

QEFFTransformersBase

QEFFAutoModelForSpeechSeq2Seq.export(export_dir: str | None = None) str[source]

Export the model to ONNX format using torch.onnx.export.

This method prepares example inputs and dynamic axes based on the model configuration, then exports the model to an ONNX graph suitable for compilation and deployment on Cloud AI 100 hardware.

Parameters:

export_dir (str, optional) – Directory path where the exported ONNX graph will be saved. If not provided, the default export directory is used.

Returns:

Path to the generated ONNX graph file.

Return type:

str

QEFFAutoModelForSpeechSeq2Seq.compile(onnx_path: str | None = None, compile_dir: str | None = None, *, prefill_seq_len: int | None = 1, encoder_ctx_len: int | None = None, ctx_len: int = 150, full_batch_size: int | None = None, kv_cache_batch_size: int | None = None, batch_size: int = 1, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, num_speculative_tokens: int | None = None, **compiler_options) str[source]

Compile the exported ONNX model using the Cloud AI 100 Platform SDK compiler.

This method generates a qpc package. If the model has not been exported yet, this method will handle the export process. Additional arguments for the qaic-exec compiler can be passed as keyword arguments.

Parameters:
  • onnx_path (str, optional) – Path to a pre-exported ONNX model. If not provided, the model will be exported first.

  • compile_dir (str, optional) – Directory to save the generated QPC package.

  • prefill_seq_len (int, optional) – Prefill sequence length. This parameter is typically not critically used for SpeechSeq2Seq models’ decoder compilation as the first decoder input is seq_len=1. Default is 1.

  • encoder_ctx_len (int, optional) – Maximum context length for the encoder part of the model. If None, it’s inferred from the model configuration or defaults (e.g., 1500 for Whisper).

  • ctx_len (int, optional) – Maximum decoder context length. This defines the maximum output sequence length the compiled model can handle. Default is 150.

  • batch_size (int, optional) – Batch size. Default is 1.

  • num_devices (int, optional) – Number of devices to compile for. Default is 1.

  • num_cores (int, optional) – Number of cores to use for compilation.

  • mxfp6_matmul (bool, optional) – Use MXFP6 compression for weights. Default is False.

  • mxint8_kv_cache (bool, optional) – Use MXINT8 compression for KV cache. Default is False.

  • full_batch_size (int, optional) – Not yet supported for this model.

  • kv_cache_batch_size (int, optional) – Not yet supported for this model.

  • num_speculative_tokens (int, optional) – Not yet supported for this model.

  • **compiler_options (dict) –

    Additional compiler options for QAIC.

    For QAIC Compiler: Extra arguments for qaic-exec can be passed. Some common options include:

    • mos (int, optional): Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.

    • aic_enable_depth_first (bool, optional): Enables DFS with default memory size. Defaults to False.

    • allow_mxint8_mdp_io (bool, optional): Allows MXINT8 compression of MDP IO traffic. Defaults to False.

    Params are converted to flags as below:

    • aic_num_cores=16 -> -aic-num-cores=16

    • convert_to_fp16=True -> -convert-to-fp16

Returns:

Path to the compiled QPC package.

Return type:

str

QEFFAutoModelForSpeechSeq2Seq.generate(inputs: Tensor, generation_len: int, streamer: TextStreamer | None = None, device_ids: List[int] | None = None) Tensor | ndarray[source]

Generate output until <|endoftext|> token or generation_len is reached, by executing the compiled QPC on Cloud AI 100 hardware.

This method performs sequential execution based on the compiled model’s batch size and the provided audio tensors. It manages the iterative decoding process and KV cache.

Parameters:
  • inputs (Dict[str, np.ndarray]) – Model inputs for inference, typically a dictionary containing: - input_features (np.ndarray): Preprocessed audio features. - decoder_input_ids (np.ndarray): Initial decoder input IDs (e.g., start token). - decoder_position_ids (np.ndarray): Initial decoder position IDs. These should be prepared to match the compiled model’s expectations.

  • generation_len (int) – Maximum number of tokens to generate. The generation stops if this limit is reached or the model generates an end-of-sequence token.

  • streamer (TextStreamer, optional) – Streamer to receive generated tokens in real-time. Default is None.

  • device_ids (List[int], optional) – Device IDs for running the QPC. Defaults to [0] if not specified.

Returns:

Output from the AI 100 runtime, including generated IDs and performance metrics.

Return type:

CloudAI100ExecInfoNew

Raises:

TypeError – If the QPC path is not set (i.e., compile was not run).