Python API

This page give you an overview about the all the APIs that you might need to integrate the QEfficient into your python applications.

High Level API

`QEFFAutoModelForCausalLM`

class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCausalLM(model: Module, continuous_batching: bool = False, qaic_config: dict | None = None, **kwargs)[source]

The QEFF class is designed for manipulating any causal language model from the HuggingFace hub. Although it is possible to initialize the class directly, we highly recommend using the from_pretrained method for initialization.

Mandatory Args:

model (nn.Module):: PyTorch model
continuous_batching (bool):: Weather this model will be used for continuous batching in future. If this is not set True here, the model can not be exported/compiled for continuous batching later.
qaic_config (Optional[dict]):: Qaic config with supported keys of speculative_model_type to specify speculative decoding TLM models.

from QEfficient import QEFFAutoModelForCausalLM
from transformers import AutoTokenizer

model_name = "gpt2"
model = QEFFAutoModelForCausalLM.from_pretrained(model_name, num_hidden_layers=2)
model.compile(prefill_seq_len=128, ctx_len=256, num_cores=16, num_devices=1)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model.generate(prompts=["Hi there!!"], tokenizer=tokenizer)

export(export_dir: str | None = None) → str[source]

Exports the model to ONNX format using torch.onnx.export.

Optional Args:

export_dir (str, optional):: The directory path to store ONNX-graph.

Returns:

str:: Path of the generated ONNX graph.

compile(onnx_path: str | None = None, compile_dir: str | None = None, *, prefill_seq_len: int = 32, ctx_len: int = 128, batch_size: int = 1, full_batch_size: int | None = None, kv_cache_batch_size: int | None = None, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, num_speculative_tokens: int | None = None, prefill_only: bool | None = None, **compiler_options) → str[source]

This method compiles the exported ONNX model using the Cloud AI 100 Platform SDK compiler binary found at /opt/qti-aic/exec/qaic-exec and generates a qpc package. If the model has not been exported yet, this method will handle the export process. You can pass any other arguments that the qaic-exec takes as extra kwargs.

Optional Args:

onnx_path (str, optional):

Path to pre-exported onnx model.

compile_dir (str, optional):

Path for saving the qpc generated.

num_cores (int):

Number of cores used to compile the model.

num_devices (int):

Number of devices the model needs to be compiled for. Defaults to 1.

batch_size (int, optional):

Batch size. Defaults to 1.

prefill_seq_len (int, optional):

The length of the Prefill prompt should be less that prefill_seq_len. Defaults to 32.

ctx_len (int, optional):

Maximum ctx that the compiled model can remember. Defaults to 128.

full_batch_size (int, optional):

Continuous batching batch size.

mxfp6_matmul (bool, optional):

Whether to use mxfp6 compression for weights. Defaults to False.

mxint8_kv_cache (bool, optional):

Whether to use mxint8 compression for KV cache. Defaults to False.

num_speculative_tokens (int, optional):

Number of speculative tokens to take as input for Speculative Decoding Target Language Model.

prefill_only (bool):

if True compile for prefill only and if False compile for decode only. Defaults to None, which compiles for both prefill and ``decode.

compiler_options (dict, optional):

Additional compiler options. Defaults to None. For QAIC Compiler: Extra arguments for qaic-exec can be passed.

mos (int, optional):

Effort level to reduce on-chip memory. Defaults to -1, meaning no effort. Defaults to -1.

aic_enable_depth_first (bool, optional):

Enables DFS with default memory size. Defaults to False.

allow_mxint8_mdp_io (bool, optional):

Allows MXINT8 compression of MDP IO traffic. Defaults to False.

Params are converted to flags as below: - aic_num_cores=16 -> -aic-num-cores=16 - convert_to_fp16=True -> -convert-to-fp16

For QNN Compiler: Following arguments can be passed.

enable_qnn (bool):: Enables QNN Compilation.
qnn_config (str):: Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.

Returns:

str:: Path of the compiled qpc package.

generate(tokenizer: PreTrainedTokenizerFast | PreTrainedTokenizer, prompts: List[str], device_id: List[int] | None = None, runtime_ai100: bool = True, **kwargs)[source]

This method generates output until eos or generation_len by executing the compiled qpc on Cloud AI 100 Hardware cards. This is a sequential execution based on the batch_size of the compiled model and the number of prompts passed. If the number of prompts cannot be divided by the batch_size, the last unfulfilled batch will be dropped.

Mandatory Args:

tokenizer (Union[PreTrainedTokenizerFast, PreTrainedTokenizer]):: Pass tokenizer of the model.
prompts (List[str]):: List of prompts to run the execution.

optional Args:

device_id (List[int]):: Ids of devices for running the qpc pass as [0] in case of normal model / [0, 1, 2, 3] in case of tensor slicing model
runtime_ai100 (bool, optional):: AI_100 and PyTorch runtime is supported as of now. Defaults to True for AI_100 runtime.

`QEFFAutoModel`

class QEfficient.transformers.models.modeling_auto.QEFFAutoModel(model: Module, **kwargs)[source]

The QEFFAutoModel class is designed for manipulating any transformer model from the HuggingFace hub. Although it is possible to initialize the class directly, we highly recommend using the from_pretrained method for initialization.

Mandatory Args:

model (nn.Module):: PyTorch model

from QEfficient import QEFFAutoModel
from transformers import AutoTokenizer

# Initialize the model using from_pretrained similar to transformers.AutoModel.
model = QEFFAutoModel.from_pretrained("model_name")

# Now you can directly compile the model for Cloud AI 100
model.compile(num_cores=16)  # Considering you have a Cloud AI 100 SKU

#prepare input
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("My name is", return_tensors="pt")

# You can now execute the model
model.generate(inputs)

export(export_dir: str | None = None) → str[source]

Exports the model to ONNX format using torch.onnx.export.

Optional Args:

export_dir (str, optional):: The directory path to store ONNX-graph.

Returns:

str:: Path of the generated ONNX graph.

compile(onnx_path: str | None = None, compile_dir: str | None = None, *, seq_len: int = 32, batch_size: int = 1, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, **compiler_options) → str[source]

This method compiles the exported ONNX model using the Cloud AI 100 Platform SDK compiler binary found at /opt/qti-aic/exec/qaic-exec and generates a qpc package. If the model has not been exported yet, this method will handle the export process. You can pass any other arguments that the qaic-exec takes as extra kwargs.

Optional Args:

onnx_path (str, optional):

Path to pre-exported onnx model.

compile_dir (str, optional):

Path for saving the qpc generated.

seq_len (int, optional):

The length of the prompt should be less that seq_len. Defaults to 32.

batch_size (int, optional):

Batch size. Defaults to 1.

num_devices (int):

Number of devices the model needs to be compiled for. Defaults to 1.

num_cores (int):

Number of cores used to compile the model.

mxfp6_matmul (bool, optional):

Whether to use mxfp6 compression for weights. Defaults to False.

compiler_options (dict, optional):

Additional compiler options. For QAIC Compiler: Extra arguments for qaic-exec can be passed.

aic_enable_depth_first (bool, optional):

Enables DFS with default memory size. Defaults to False.

allow_mxint8_mdp_io (bool, optional):

Allows MXINT8 compression of MDP IO traffic. Defaults to False.

For QNN Compiler: Following arguments can be passed.

enable_qnn (bool):: Enables QNN Compilation.
qnn_config (str):: Path of QNN Config parameters file. Any extra parameters for QNN compilation can be passed via this file.

Returns:

str:: Path of the compiled qpc package.

generate(inputs: Tensor, device_ids: List[int] | None = None, runtime_ai100: bool = True) → Tensor | ndarray[source]

This method generates output by executing PyTorch runtime or the compiled qpc on Cloud AI 100 Hardware cards. Mandatory Args:

inputs (Union[torch.Tensor, np.ndarray]):

inputs to run the execution.

optional Args:

device_id (List[int]):: Ids of devices for running the qpc pass as [0] in case of normal model / [0, 1, 2, 3] in case of tensor slicing model
runtime_ai100 (bool, optional):: AI_100 and PyTorch runtime is supported as of now. Defaults to True for AI_100 runtime.

Returns:

dict:: Output from the AI_100 or PyTorch runtime.

cloud_ai_100_feature_generate(inputs: Tensor, device_ids: List[int] = [0]) → ndarray[source]

Generates features with list of prompts using AI 100 runtime.

Mandatory Args:

inputs (Union[torch.Tensor, np.ndarray]):: inputs to run the execution.

Optional Args:

device_ids (List[int], optional): A list of device IDs to use for the session. Defaults to [0].

Returns:

np.ndarray: A list of dictionaries containing the generated output features.

pytorch_feature_generate(model, inputs: Tensor | ndarray) → List[Tensor][source]

Generates features from a list of text prompts using a PyTorch model.

Mandatory Args:

model:: The transformed PyTorch model used for generating features.
inputs (Union[torch.Tensor, np.ndarray]):: inputs to run the execution.

Returns:

torch.Tensor: A list of output features generated by the model for each prompt.

`QEffAutoPeftModelForCausalLM`

class QEfficient.peft.auto.QEffAutoPeftModelForCausalLM(model: Module)[source]

QEff class for loading models with PEFT adapters (Only LoRA is supported currently). Once exported and compiled for an adapter, the same can be utilized for another adapter with same base model and adapter config.

Args:

model (nn.Module):: PyTorch model

from QEfficient import QEffAutoPeftModelForCausalLM

m = QEffAutoPeftModelForCausalLM.from_pretrained("predibase/magicoder", "magicoder")
m.export()
m.compile(prefill_seq_len=32, ctx_len=1024)

inputs = ...  # A coding prompt
outputs = m.generate(**inputs)

inputs = ...  # A math prompt
m.load_adapter("predibase/gsm8k", "gsm8k")
m.set_adapter("gsm8k")
outputs = m.generate(**inputs)

load_adapter(model_id: str, adapter_name: str)[source]

Loads a new adapter from huggingface hub or local path

Args:

model_id (str):: Adapter model ID from huggingface hub or local path
adapter_name (str):: Adapter name to be used to set this adapter as current

property active_adapter: str: Currently active adapter to be used for inference

set_adapter(adapter_name: str)[source]: Sets active adapter from one of the loaded adapters

classmethod from_pretrained(pretrained_name_or_path: str, *args, **kwargs)[source]

Args:

pretrained_name_or_path (str):: Model card name from huggingface or local path to model directory.
finite_adapters (bool):: set True to enable finite adapter mode with QEffAutoLoraModelForCausalLM class. Please refer to QEffAutoLoraModelForCausalLM for API specification.
adapter_name (str):: Name used to identify loaded adapter.
args, kwargs:: Additional arguments to pass to peft.AutoPeftModelForCausalLM.

export(export_dir: str | None = None) → str[source]

Exports the model to ONNX format using torch.onnx.export.

Args:

export_dir (str):: Specify the export directory. The export_dir will be suffixed with a hash corresponding to current model.

Returns:

Path:: Path of the generated ONNX file.

compile(onnx_path: str | None = None, compile_dir: str | None = None, *, batch_size: int = 1, prefill_seq_len: int, ctx_len: int, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, **compiler_options) → str[source]

Compile the exported onnx to run on AI100. If the model has not been exported yet, this method will handle the export process.

Args:

onnx_path (str):: Onnx file to compile
compile_dir (str):: Directory path to compile the qpc. A suffix is added to the directory path to avoid reusing same qpc for different parameters.
num_devices (int):: Number of devices to compile for. Defaults to 1.
num_cores (int):: Number of cores to utilize in each device Defaults to 16.
mxfp6_matmul (bool):: Use MXFP6 to compress weights for MatMul nodes to run faster on device. Defaults to False.
mxint8_kv_cache (bool):: Use MXINT8 to compress KV-cache on device to access and update KV-cache faster. Defaults to False.
compiler_options:: Pass any compiler option as input.

Following flag can be passed in compiler_options to enable QNN Compilation path.

enable_qnn (bool):: Enables QNN Compilation. Defaults to False. if not passed.
qnn_config (str):: Path of QNN Config parameters file. Defaults to None. if not passed

for QAIC compilation path, any flag that is supported by qaic-exec can be passed. Params are converted to flags as below:

aic_num_cores=16 -> -aic-num-cores=16
convert_to_fp16=True -> -convert-to-fp16

QEFFAutoModelForCausalLM Args:

full_batch_size (int):: Full batch size to allocate cache lines.
batch_size (int):: Batch size to compile for. Defaults to 1.
prefill_seq_len (int):: Prefill sequence length to compile for. Prompt will be chunked according to this length.
ctx_len (int):: Context length to allocate space for KV-cache tensors.

Returns:

str:: Path of the compiled qpc package.

Generate tokens from compiled binary. This method takes same parameters as HuggingFace transformers model.generate() method.

Args:

inputs:: input_ids
generation_config:: Merge this generation_config with model-specific for the current generation.
stopping_criteria:: Pass custom stopping_criteria to stop at a specific point in generation.
streamer:: Streamer to put the generated tokens into.
kwargs:: Additional parameters for generation_config or to be passed to the model while generating.

`QEffAutoLoraModelForCausalLM`

class QEfficient.peft.lora.auto.QEffAutoLoraModelForCausalLM(model: Module, continuous_batching: bool = False, **kwargs)[source]

QEff class for loading models with multiple LoRA adapters. Currently only Mistral and Llama model are supported. Once exported and compiled, the qpc can perform mixed batch inference with provided prompt_to_adapter_mapping.

Args:

model (nn.Module):: PyTorch model
continuous_batching (bool):: Weather this model will be used for continuous batching in future. If this is not set True here, the model can not be exported/compiled for continuous batching later.

from QEfficient.peft.lora import QEffAutoLoraModelForCausalLM

m = QEffAutoPeftModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
m.load_adapter("predibase/gsm8k", "gsm8k")
m.load_adapter("predibase/magicoder", "magicoder")
m.compile(num_cores=16, device_group=[0])

prompts=["code prompt", "math prompt", "generic"]
m.generate(prompts, device_group=[0], prompt_to_adapter_mapping=["magicoder","gsm8k_id","base"])

download_adapter(adapter_model_id: str, adapter_name: str, adapter_weight: dict | None = None, adapter_config: PeftConfig | None = None)[source]

Loads a new adapter from huggingface hub or local path into CPU cache

Mandatory Args:

adapter_model_id (str):: Adapter model ID from huggingface hub or local path
adapter_name (str):: Adapter name to be used to downloaded this adapter

Optional Args:

adapter_weight (dict):: Adapter weight tensors in dictionary format
adapter_config (PeftConfig):: Adapter config in the format of PeftConfig

load_adapter(adapter_model_id: str, adapter_name: str, adapter_weight: dict | None = None, adapter_config: PeftConfig | None = None)[source]

Load adapter into CPU cache and set it as active

Mandatory Args:

adapter_model_id (str):: Adapter model ID from huggingface hub or local path
adapter_name (str):: Adapter name to be used to load this adapter

Optional Args:

adapter_weight (dict):: Adapter weight tensors in dictionary format
adapter_config (PeftConfig):: Adapter config in the format of PeftConfig

unload_adapter(adapter_name: str)[source]

Deactivate adpater and remove it from CPU cache

Mandatory Args:

adapter_name (str):: Adapter name to be unloaded

export(export_dir: str | None = None) → str[source]

Exports the model to ONNX format using torch.onnx.export. We currently don’t support exporting non-transformed models. Please refer to the convert_to_cloud_bertstyle function in the Low-Level API for a legacy function that supports this.”

Optional Args:

does not any arguments.

Returns:

str:: Path of the generated ONNX graph.

generate(tokenizer: PreTrainedTokenizerFast | PreTrainedTokenizer, prompts: List[str], prompt_to_adapter_mapping: List[str] | None = None, device_id: List[int] | None = None, runtime: str | None = 'AI_100', **kwargs)[source]

This method generates output until eos or generation_len by executing the compiled qpc on Cloud AI 100 Hardware cards. This is a sequential execution based on the batch_size of the compiled model and the number of prompts passed. If the number of prompts cannot be divided by the batch_size, the last unfulfilled batch will be dropped.

Mandatory Args:

tokenizer (PreTrainedTokenizerFast or PreTrainedTokenizer):: The tokenizer used in the inference
prompts (List[str]):: List of prompts to run the execution.
prompt_to_adapter_mapping (List[str]):: The sequence of the adapter names will be matched with sequence of prompts and corresponding adapters will be used for the prompts.”base” for base model (no adapter).

optional Args:

device_id (List[int]):: Device IDs to be used for execution. If len(device_id) > 1, it enables multiple card setup. If None, auto-device-picker will be used. Defaults to None.
runtime (str, optional):: Only AI_100 runtime is supported as of now; ONNXRT and PyTorch coming soon. Defaults to “AI_100”.

`QEFFAutoModelForImageTextToText`

class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForImageTextToText(model: Module, kv_offload: bool | None = True, **kwargs)[source]

The QEFFAutoModelForImageTextToText class is used to work with multimodal language models from the HuggingFace hub. While you can initialize the class directly, it’s best to use the from_pretrained method for this purpose. This class supports both single and dual QPC approaches. Attributes:

_hf_auto_class (class): The Hugging Face AutoModel class for ImageTextToText models.

Mandatory Args:

pretrained_model_name_or_path (str):: Model card name from HuggingFace or local path to model directory.

Optional Args:

kv_offload (bool):: Flag to toggle between single and dual QPC approaches. If set to False, the Single QPC approach will be used; otherwise, the dual QPC approach will be applied. Defaults to True.

import requests
from PIL import Image
from transformers import AutoProcessor, TextStreamer

from QEfficient import QEFFAutoModelForImageTextToText

# Add HuggingFace Token to access the model
HF_TOKEN = ""
model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct"
query = "Describe this image."
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"

## STEP - 1 Load the Processor and Model, and kv_offload=True/False for dual and single qpc
processor = AutoProcessor.from_pretrained(model_name, token=HF_TOKEN)
model = QEFFAutoModelForImageTextToText.from_pretrained(model_name, token=HF_TOKEN, attn_implementation="eager", kv_offload=False)

## STEP - 2 Export & Compile the Model
model.compile(
    prefill_seq_len=32,
    ctx_len=512,
    img_size=560,
    num_cores=16,
    num_devices=1,
    mxfp6_matmul=False,
)

## STEP - 3 Load and process the inputs for Inference
image = Image.open(requests.get(image_url, stream=True).raw)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": query},
        ],
    }
]
input_text = [processor.apply_chat_template(messages, add_generation_prompt=True)]
inputs = processor(
    text=input_text,
    images=image,
    return_tensors="pt",
    add_special_tokens=False,
    padding="max_length",
    max_length=32,
)

## STEP - 4 Run Inference on the compiled model
streamer = TextStreamer(processor.tokenizer)
model.generate(inputs=inputs, streamer=streamer, generation_len=512)

`QEFFAutoModelForSpeechSeq2Seq`

class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForSpeechSeq2Seq(*args, **kwargs)[source]

The QEFFAutoModelForSpeechSeq2Seq class is designed for transformers models with a sequence-to-sequence speech-to-text modeling head, including Whisper and other Encoder-Decoder speech models. Although it is possible to initialize the class directly, we highly recommend using the from_pretrained method for initialization.

Mandatory Args:

model (nn.Module):: PyTorch model

from QEfficient import QEFFAutoModelForSpeechSeq2Seq
from processors import AutoProcessor

# Initialize the model using from_pretrained similar to transformers.AutoModelForSpeechSeq2Seq.
model = QEFFAutoModelForSpeechSeq2Seq.from_pretrained("model_name")

# Now you can directly compile the model for Cloud AI 100
model.compile(num_cores=16, device_group=[0])  # Considering you have a Cloud AI 100 SKU

#prepare inputs
processor = AutoProcessor.from_pretrained(model_name)
input_audio, sample_rate = [...] # audio data loaded in via some external audio package, such as librosa or soundfile
input_features = (
    processor(data, sampling_rate=sample_rate, return_tensors="pt").input_features.numpy().astype(np.float32)
)
decoder_input_ids = (
    torch.ones((batch_size, 1), dtype=torch.int64) * model.model.config.decoder_start_token_id
).numpy()
decoder_position_ids = torch.arange(1, dtype=torch.int64).view(1, 1).repeat(batch_size, 1).numpy()
inputs = dict(
    input_features=input_features,
    decoder_input_ids=decoder_input_ids,
    decoder_position_ids=decoder_position_ids,
)

# You can now execute the model
model.generate(inputs, generation_len=150)

export(export_dir: str | None = None) → str[source]

Exports the model to ONNX format using torch.onnx.export.

Optional Args: :export_dir (str, optional): The directory path to store ONNX-graph.

Returns:

str:: Path of the generated ONNX graph.

compile(onnx_path: str | None = None, compile_dir: str | None = None, *, prefill_seq_len: int | None = 1, encoder_ctx_len: int | None = None, ctx_len: int = 150, full_batch_size: int | None = None, kv_cache_batch_size: int | None = None, batch_size: int = 1, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, num_speculative_tokens: int | None = None, **compiler_options) → str[source]

This method compiles the exported ONNX model using the Cloud AI 100 Platform SDK compiler binary found at /opt/qti-aic/exec/qaic-exec and generates a qpc package. If the model has not been exported yet, this method will handle the export process. You can pass any other arguments that the qaic-exec takes as extra kwargs.

Optional Args:

onnx_path (str, optional):: Path to pre-exported onnx model.
compile_dir (str, optional):: Path for saving the qpc generated.
encoder_ctx_len (int, optional):: The maximum length of context for encoder, based on the AutoProcessor output. Defaults to checking config, if None in config then 1500
ctx_len (int, optional):: The maximum length of context to keep for decoding. Defaults to 150.
batch_size (int, optional):: Batch size. Defaults to 1.
num_devices (int):: Number of devices the model needs to be compiled for. Defaults to 1.
num_cores (int):: Number of cores used to compile the model.
mxfp6_matmul (bool, optional):: Whether to use mxfp6 compression for weights. Defaults to False.
aic_enable_depth_first (bool, optional):: Enables DFS with default memory size. Defaults to False.

Other args are not yet implemented for AutoModelForSpeechSeq2Seq

Returns:

str:: Path of the compiled qpc package.

generate(inputs: Tensor, generation_len: int, streamer: TextStreamer | None = None, device_ids: List[int] | None = None) → Tensor | ndarray[source]

This method generates output until endoftranscript or generation_len by executing the compiled qpc on Cloud AI 100 Hardware cards. This is a sequential execution based on the batch_size of the compiled model and the number of audio tensor passed.

Mandatory Args:

processor:: autoprocessor to process inputs and decode logits
inputs (torch.Tensor):: inputs to run the execution.
generation_len (int):: length upto which to generate
device_id (List[int]):: Ids of devices for running the qpc pass as [0] in case of normal model / [0, 1, 2, 3] in case of tensor slicing model

Returns:

dict:: Output from the AI_100 or PyTorch runtime.

`export`

QEfficient.exporter.export_hf_to_cloud_ai_100.qualcomm_efficient_converter(model_name: str, model_kv: QEFFBaseModel | None = None, local_model_dir: str | None = None, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, cache_dir: str | None = None, onnx_dir_path: str | None = None, hf_token: str | None = None, seq_length: int = 32, kv: bool = True, form_factor: str = 'cloud', full_batch_size: int | None = None) → Tuple[str, str][source]

This method is an alias for QEfficient.export.

Usage 1: This method can be used by passing model_name and local_model_dir or cache_dir if required for loading from local dir. This will download the model from HuggingFace and export it to ONNX graph and returns generated files path check below.

Usage 2: You can pass model_name and model_kv as an object of QEfficient.QEFFAutoModelForCausalLM, In this case will directly export the model_kv.model to ONNX

We will be deprecating this function and it will be replaced by QEFFAutoModelForCausalLM.export.

Mandatory Args:

model_name (str):: The name of the model to be used.

Optional Args:

model_kv (torch.nn.Module):: Transformed KV torch model to be used. Defaults to None.
local_model_dir (str):: Path of local model. Defaults to None.
tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):: Model tokenizer. Defaults to None.
cache_dir (str):: Path of the cache directory. Defaults to None.
onnx_dir_path (str):: Path to store ONNX file. Defaults to None.
hf_token (str):: HuggingFace token to access gated models. Defaults is None.
seq_len (int):: The length of the sequence. Defaults is 128.
kv (bool):: If false, it will export to Bert style. Defaults is True.
form_factor (str):: Form factor of the hardware, currently only cloud is accepted. Defaults to cloud.

Returns:

Tuple[str, str]:: Path to Base ONNX dir and path to generated ONNX model

import QEfficient
base_path, onnx_model_path = QEfficient.export(model_name="gpt2")

Deprecated since version This: function will be deprecated in version 1.19, please use QEFFAutoModelForCausalLM.export instead

`compile`

QEfficient.compile.compile_helper.compile(onnx_path: str, qpc_path: str, num_cores: int, device_group: List[int] | None = None, aic_enable_depth_first: bool = False, mos: int = -1, batch_size: int = 1, prompt_len: int = 32, ctx_len: int = 128, mxfp6: bool = True, mxint8: bool = False, custom_io_file_path: str | None = None, full_batch_size: int | None = None, allow_mxint8_mdp_io: bool | None = False, enable_qnn: bool | None = False, qnn_config: str | None = None, **kwargs) → str[source]

Compiles the given ONNX model using Cloud AI 100 platform SDK compiler and saves the compiled qpc package at qpc_path. Generates tensor-slicing configuration if multiple devices are passed in device_group.

This function will be deprecated soon and will be replaced by QEFFAutoModelForCausalLM.compile.

Mandatory Args:

onnx_path (str):: Generated ONNX Model Path.
qpc_path (str):: Path for saving compiled qpc binaries.
num_cores (int):: Number of cores to compile the model on.

Optional Args:

device_group (List[int]):: Used for finding the number of devices to compile for. Defaults to None.
aic_enable_depth_first (bool):: Enables DFS with default memory size. Defaults to False.
mos (int):: Effort level to reduce the on-chip memory. Defaults to -1.
batch_size (int):: Batch size to compile the model for. Defaults to 1.
full_batch_size (int):: Set full batch size to enable continuous batching mode. Default to None
prompt_len (int):: Prompt length for the model to compile. Defaults to 32
ctx_len (int):: Maximum context length to compile the model. Defaults to 128
mxfp6 (bool):: Enable compilation for MXFP6 precision. Defaults to True.
mxint8 (bool):: Compress Present/Past KV to MXINT8 using CustomIO config. Defaults to False.
custom_io_file_path (str):: Path to customIO file (formatted as a string). Defaults to None.
allow_mxint8_mdp_io (bool):: Allows MXINT8 compression of MDP IO traffic Defaults to False.
enable_qnn (bool):: Enables QNN Compilation. Defaults to False.
qnn_config (str):: Path of QNN Config parameters file. Defaults to None.

Returns:

str:: Path to compiled qpc package.

import QEfficient
base_path, onnx_model_path = QEfficient.export(model_name="gpt2")
qpc_path = QEfficient.compile(onnx_path=onnx_model_path, qpc_path=os.path.join(base_path, "qpc"), num_cores=14, device_group=[0])

Deprecated since version This: function will be deprecated in version 1.19, please use QEFFAutoModelForCausalLM.compile instead

`Execute`

class QEfficient.generation.text_generation_inference.CloudAI100ExecInfo(batch_size: int, generated_texts: List[str] | List[List[str]], generated_ids: List[ndarray] | ndarray, perf_metrics: PerfMetrics)[source]

Bases: object

Holds all the information about Cloud AI 100 execution

Args:

batch_size (int):: Batch size of the QPC compilation.
generated_texts (Union[List[List[str]], List[str]]):: Generated text(s).
generated_ids (Union[List[np.ndarray], np.ndarray]):: Generated IDs.
perf_metrics (PerfMetrics):: Performance metrics.

class QEfficient.generation.text_generation_inference.CloudAI100ExecInfoNew(batch_size: int, generated_ids: List[numpy.ndarray] | numpy.ndarray, perf_metrics: QEfficient.generation.text_generation_inference.PerfMetrics)[source]: Bases: object

class QEfficient.generation.text_generation_inference.PerfMetrics(prefill_time: float, decode_perf: float, total_perf: float, total_time: float)[source]

Bases: object

Holds all performance metrics

Args:

prefill_time (float):: Time for prefilling.
decode_perf (float):: Decoding performance.
total_perf (float):: Total performance.
total_time (float):: Total time.

QEfficient.generation.text_generation_inference.calculate_latency(total_decoded_tokens, loop_start, start, end, decode_pause_time=0)[source]

Method will calculate the latency metrics using the time loops and based on the total decoded token count.

Args:

total_decoded_tokens (int):: Number of tokens generated in decode stage.
loop_start (float):: Start time of decode loop.
start (float):: Start time.
end (float):: End time.
decode_pause_time (float):: Total decode pause time in continuous batching decode stage.

Returns: :tuple: prefill time, decode performance, total performance, total time

QEfficient.generation.text_generation_inference.cloud_ai_100_exec_kv(tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, qpc_path: str, prompt: str | None = None, prompts_txt_file_path: str | None = None, device_id: List[int] | None = None, generation_len: int | None = None, enable_debug_logs: bool = False, stream: bool = True, write_io_dir: str | None = None, automation=False, prompt_to_lora_id_mapping: List[int] | None = None, is_tlm: bool = False)[source]

This method generates output until eos or generation_len by executing the compiled qpc on Cloud AI 100 Hardware cards. This is a sequential execution based on the batch_size of the compiled model and the number of prompts passed. If the number of prompts cannot be divided by the batch_size, the last unfulfilled batch will be dropped.

Mandatory Args:

tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):: Model tokenizer.
qpc_path (str):: Path to the saved generated binary file after compilation.

Optional Args:

prompt (str):: Sample prompt for the model text generation. Defaults to None.
prompts_txt_file_path (str):: Path of the prompt text file. Defaults to None.
generation_len (int):: Maximum context length for the model during compilation. Defaults to None.
device_id (List[int]):: Device IDs to be used for execution. If len(device_id) > 1, it enables multiple card setup. If None, auto-device-picker will be used. Defaults to None.
enable_debug_logs (bool):: If True, it enables debugging logs. Defaults to False.
stream (bool):: If True, enable streamer, which returns tokens one by one as the model generates them. Defaults to True.
Write_io_dir (str):: Path to write the input and output files. Defaults to None.
automation (bool):: If true, it prints input, output, and performance stats. Defaults to False.
prompt_to_lora_id_mapping (List[int]):: Mapping to associate prompts with their respective LoRA adapter.

Returns:

CloudAI100ExecInfo:: Object holding execution output and performance details.

import transformers
import QEfficient
base_path, onnx_model_path = QEfficient.export(model_name="gpt2")
qpc_path = QEfficient.compile(onnx_path=onnx_model_path, qpc_path=os.path.join(base_path, "qpc"), num_cores=14, device_group=[0])
tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")
exec_info = QEfficient.cloud_ai_100_exec_kv(tokenizer=tokenizer, qpc_path=qpc_path, prompt="Hi there!!", device_id=[0])

QEfficient.generation.text_generation_inference.fix_prompt_to_lora_id_mapping(prompt_to_lora_id_mapping: List[int], batch_size: int, full_batch_size: int | None = None)[source]

Adjusts the list of prompt_to_lora_id_mapping to match the required batch size.

Mandatory Args:: prompt_to_lora_id_mapping (Optional[List[int]]): Mapping to associate prompts with their respective LoRA adapter. batch_size (int): The batch size to process at a time.
Optional Args:: full_batch_size (Optional[int]): The full batch size if different from batch_size.
Returns:: List[int]: Adjusted list of prompt_to_lora_id_mapping.

QEfficient.generation.text_generation_inference.fix_prompts(prompt: List[str], batch_size: int, full_batch_size: int | None = None)[source]

Adjusts the list of prompts to match the required batch size.

Mandatory Args:: prompt (List[str]): List of input prompts. batch_size (int): The batch size to process at a time.
Optional Args:: full_batch_size (Optional[int]): The full batch size if different from batch_size.
Returns:: List[str]: Adjusted list of prompts.

QEfficient.generation.text_generation_inference.get_compilation_dims(qpc_path: str) → Tuple[int, int, int | None][source]

Function to fetch compilation dimensions from specializations.json. Uses qpc path to compute path to specializations.json.

Args:: qpc_path (str): Path to directory comprising generated binary file after compilation.

Returns: :tuple: compilation batch size, compilation context length, compilation full batch size

Low Level API

`convert_to_cloud_kvstyle`

QEfficient.exporter.export_hf_to_cloud_ai_100.convert_to_cloud_kvstyle(model_name: str, qeff_model: QEFFAutoModelForCausalLM, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, onnx_dir_path: str, seq_len: int) → str[source]

API to convert model with kv retention and export to ONNX. KV Style Approach-

This architecture is particularly suitable for auto-regressive tasks.

where sequence generation involves processing one token at a time.

And contextual information from earlier tokens is crucial for predicting the next token.

The inclusion of a kV cache enhances the efficiency of the decoding process, making it more computationally efficient.

Mandatory Args:

model_name (str):: Hugging Face Model Card name, Example: gpt2.
qeff_model (QEFFAutoModelForCausalLM):: Transformed KV torch model to be used.
tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):: Model tokenizer.
onnx_dir_path (str):: Path to save exported ONNX file.
seq_len (int):: The length of the sequence.

Returns:

str:: Path of exported ONNX file.

`convert_to_cloud_bertstyle`

QEfficient.exporter.export_hf_to_cloud_ai_100.convert_to_cloud_bertstyle(model_name: str, qeff_model: QEFFAutoModelForCausalLM, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, onnx_dir_path: str, seq_len: int) → str[source]

API to convert model to Bertstyle approach. Bertstyle Approach:

No Prefill/Decode separably compiled.

No KV retention logic.

KV is every time computed for all the tokens until EOS/max_length.

Mandatory Args:

model_name (str):: Hugging Face Model Card name, Example: gpt2.
qeff_model (QEFFAutoModelForCausalLM):: Transformed KV torch model to be used.
tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):: Model tokenizer.
onnx_dir_path (str):: Path to save exported ONNX file.
seq_len (int):: The length of the sequence.

Returns:

str:: Path of exported ONNX file.

`utils`

QEfficient.utils.device_utils.get_available_device_id()[source]

API to check available device id.

Return:

int:: Available device id.

class QEfficient.utils.generate_inputs.InputHandler(batch_size, tokenizer, config, prompt, prompt_len, ctx_len, full_batch_size)[source]

Bases: object

prepare_ort_inputs()[source]

Function responsible for creating Prefill stage numpy inputs for ONNX model to be run on ONNXRT.

Return:

Dict:: input_ids, position_ids, past_key_values

prepare_pytorch_inputs()[source]

Function responsible for creating Prefill stage tensor inputs for PyTorch model.

Return:

Dict:: input_ids, position_ids, past_key_values

update_ort_inputs(inputs, ort_outputs)[source]

Function responsible for updating Prefill stage inputs to create inputs for decode stage inputs for ONNX model to be run on ONNXRT.

Mandatory Args:

inputs (Dict):: NumPy inputs of Onnx model from previous iteration
ort_outputs (Dict):: Numpy outputs of Onnx model from previous iteration

Return:

Dict:: Updated input_ids, position_ids and past_key_values

update_ort_outputs(ort_outputs)[source]

Function responsible for updating ONNXRT session outputs.

Mandatory Args:

ort_outputs (Dict):: Numpy outputs of Onnx model from current iteration

Return:

updated_outputs (Dict): Updated past_key_values, logits

update_pytorch_inputs(inputs, pt_outputs)[source]

Function responsible for updating Prefill stage inputs to create decode stage inputs for PyTorch model.

Mandatory Args:

inputs (Dict):: Pytorch inputs from previous iteration
pt_outputs (Dict):: Pytorch outputs from previous iteration

Return:

Dict:: Updated input_ids, position_ids and past_key_values

class QEfficient.utils.generate_inputs.InputHandlerInternVL(batch_size, config, image, processor, prompt, prompt_len, ctx_len, max_gen_len, n_layer)[source]

Bases: InputHandlerVLM

prepare_pytorch_inputs()[source]

Function responsible for creating Prefill stage tensor inputs for PyTorch model.

Return:

Dict:: input_ids, position_ids, past_key_values

prepare_vlm_ort_inputs()[source]

class QEfficient.utils.generate_inputs.InputHandlerVLM(batch_size, config, image, conversation, processor, prompt, prompt_len, ctx_len, max_gen_len, n_layer)[source]

Bases: object

prepare_pytorch_inputs()[source]

Function responsible for creating Prefill stage tensor inputs for PyTorch model.

Return:

Dict:: input_ids, position_ids, past_key_values

prepare_vlm_ort_inputs()[source]

update_vlm_ort_inputs(inputs, ort_outputs)[source]

Function responsible for updating Prefill stage inputs to create inputs for decode stage inputs for ONNX model to be run on ONNXRT.

Mandatory Args:

inputs (Dict):: NumPy inputs of Onnx model from previous iteration
ort_outputs (Dict):: Numpy outputs of Onnx model from previous iteration

Return:

Dict:: Updated input_ids, position_ids, pixel_values and past_key_values

update_vlm_ort_outputs(ort_outputs)[source]

Function responsible for updating ONNXRT session outputs.

Mandatory Args:

ort_outputs (Dict):: Numpy outputs of Onnx model from current iteration

Return:

updated_outputs (Dict): Updated past_key_values, logits, pixel_values

class QEfficient.utils.run_utils.ApiRunner(batch_size, tokenizer, config, prompt, prompt_len, ctx_len, full_batch_size=None)[source]

Bases: object

ApiRunner class is responsible for running:

HuggingFace PyTorch model
Transformed KV Pytorch Model
ONNX model on ONNXRT
ONNX model on Cloud AI 100

run_hf_model_on_pytorch(model_hf)[source]

Function responsible for running HuggingFace PyTorch model and return the output tokens

Mandatory Args:

model_hf (torch.nn.module):: Original PyTorch model

Return:

numpy.ndarray:: Generated output tokens

run_hf_model_on_pytorch_CB(model_hf)[source]

Function responsible for running HuggingFace PyTorch model and return the output tokens

Mandatory Args:

model_hf (torch.nn.module):: Original PyTorch model

Return:

numpy.ndarray:: Generated output tokens

run_kv_model_on_cloud_ai_100(qpc_path, device_group=None)[source]

Function responsible for running ONNX model on Cloud AI 100 and return the output tokens

Mandatory Args:

qpc_path (str):: path to qpc generated after compilation
device_group (List[int]):: Device Ids to be used for compilation. if len(device_group) > 1. Multiple Card setup is enabled.

Return:

numpy.ndarray:: Generated output tokens

run_kv_model_on_ort(model_path, is_tlm=False)[source]

Function responsible for running ONNX model on onnxruntime and return the output tokens

Mandatory Args:

model_path (str):: Path to the Onnx model.

Return:

numpy.ndarray:: Generated output tokens

run_kv_model_on_pytorch(model)[source]

Function responsible for running KV PyTorch model and return the output tokens

Mandatory Args: :model (torch.nn.module): Transformed PyTorch model

Return:

numpy.ndarray:: Generated output tokens

run_ort_session(inputs, session) → dict[source]

Function responsible for running onnxrt session with given inputs and passing retained state outputs to be used for next iteration inputs

Mandatory Args:

inputs (Dict):
session (onnxruntime.capi.onnxruntime_inference_collection.InferenceSession):

Return:

Dict:: Numpy outputs of Onnx model

class QEfficient.utils.run_utils.ApiRunnerInternVL(batch_size, processor, config, image, prompt, prompt_len, ctx_len, max_gen_len, n_layer)[source]

Bases: ApiRunnerVlm

ApiRunner for InternVL Vision models:

HuggingFace PyTorch model
Transformed KV Pytorch Model
ONNX model on ONNXRT
ONNX model on Cloud AI 100

run_vlm_hf_model_on_pytorch(model, inputs, generation_config)[source]

class QEfficient.utils.run_utils.ApiRunnerVlm(batch_size, processor, config, image, conversation, prompt, prompt_len, ctx_len, max_gen_len, n_layer)[source]

Bases: object

ApiRunnerVlm class is responsible for running Vision models:

HuggingFace PyTorch model
Transformed KV Pytorch Model
ONNX model on ONNXRT
ONNX model on Cloud AI 100

run_ort_session(inputs, session) → dict[source]

run_vlm_hf_model_on_pytorch(model, inputs)[source]

run_vlm_kv_model_on_ort(model_path)[source]

run_vlm_kv_model_on_pytorch(model)[source]

setup_ort_session(model_path)[source]

Python API

High Level API

QEFFAutoModelForCausalLM

QEFFAutoModel

QEffAutoPeftModelForCausalLM

QEffAutoLoraModelForCausalLM

QEFFAutoModelForImageTextToText

QEFFAutoModelForSpeechSeq2Seq

export

compile

Execute

Low Level API

convert_to_cloud_kvstyle

convert_to_cloud_bertstyle

utils

ApiRunner class is responsible for running:

ApiRunner for InternVL Vision models:

ApiRunnerVlm class is responsible for running Vision models:

`QEFFAutoModelForCausalLM`

`QEFFAutoModel`

`QEffAutoPeftModelForCausalLM`

`QEffAutoLoraModelForCausalLM`

`QEFFAutoModelForImageTextToText`

`QEFFAutoModelForSpeechSeq2Seq`

`export`

`compile`

`Execute`

`convert_to_cloud_kvstyle`

`convert_to_cloud_bertstyle`

`utils`