Low Level API

`convert_to_cloud_kvstyle`

QEfficient.exporter.export_hf_to_cloud_ai_100.convert_to_cloud_kvstyle(model_name: str, qeff_model: QEFFAutoModelForCausalLM, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, onnx_dir_path: str, seq_len: int) → str[source]

API to convert model with kv retention and export to ONNX. KV Style Approach-

This architecture is particularly suitable for auto-regressive tasks.

where sequence generation involves processing one token at a time.

And contextual information from earlier tokens is crucial for predicting the next token.

The inclusion of a kV cache enhances the efficiency of the decoding process, making it more computationally efficient.

Mandatory Args:

model_name (str):: Hugging Face Model Card name, Example: gpt2.
qeff_model (QEFFAutoModelForCausalLM):: Transformed KV torch model to be used.
tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):: Model tokenizer.
onnx_dir_path (str):: Path to save exported ONNX file.
seq_len (int):: The length of the sequence.

Returns:

str:: Path of exported ONNX file.

`convert_to_cloud_bertstyle`

QEfficient.exporter.export_hf_to_cloud_ai_100.convert_to_cloud_bertstyle(model_name: str, qeff_model: QEFFAutoModelForCausalLM, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, onnx_dir_path: str, seq_len: int) → str[source]

API to convert model to Bertstyle approach. Bertstyle Approach:

No Prefill/Decode separably compiled.

No KV retention logic.

KV is every time computed for all the tokens until EOS/max_length.

Mandatory Args:

model_name (str):: Hugging Face Model Card name, Example: gpt2.
qeff_model (QEFFAutoModelForCausalLM):: Transformed KV torch model to be used.
tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):: Model tokenizer.
onnx_dir_path (str):: Path to save exported ONNX file.
seq_len (int):: The length of the sequence.

Returns:

str:: Path of exported ONNX file.

`utils`

QEfficient.utils.device_utils.get_available_device_id()[source]

API to check available device id.

Return:

int:: Available device id.

class QEfficient.utils.generate_inputs.InputHandler(batch_size, tokenizer, config, prompt, prompt_len, ctx_len, full_batch_size)[source]

Bases: object

prepare_ort_inputs()[source]

Function responsible for creating Prefill stage numpy inputs for ONNX model to be run on ONNXRT.

Return:

Dict:: input_ids, position_ids, past_key_values

prepare_pytorch_inputs()[source]

Function responsible for creating Prefill stage tensor inputs for PyTorch model.

Return:

Dict:: input_ids, position_ids, past_key_values

update_ort_inputs(inputs, ort_outputs)[source]

Function responsible for updating Prefill stage inputs to create inputs for decode stage inputs for ONNX model to be run on ONNXRT.

Mandatory Args:

inputs (Dict):: NumPy inputs of Onnx model from previous iteration
ort_outputs (Dict):: Numpy outputs of Onnx model from previous iteration

Return:

Dict:: Updated input_ids, position_ids and past_key_values

update_ort_outputs(ort_outputs)[source]

Function responsible for updating ONNXRT session outputs.

Mandatory Args:

ort_outputs (Dict):: Numpy outputs of Onnx model from current iteration

Return:

updated_outputs (Dict): Updated past_key_values, logits

update_pytorch_inputs(inputs, pt_outputs)[source]

Function responsible for updating Prefill stage inputs to create decode stage inputs for PyTorch model.

Mandatory Args:

inputs (Dict):: Pytorch inputs from previous iteration
pt_outputs (Dict):: Pytorch outputs from previous iteration

Return:

Dict:: Updated input_ids, position_ids and past_key_values

class QEfficient.utils.run_utils.ApiRunner(batch_size, tokenizer, config, prompt, prompt_len, ctx_len, full_batch_size=None)[source]

Bases: object

ApiRunner class is responsible for running:

HuggingFace PyTorch model
Transformed KV Pytorch Model
ONNX model on ONNXRT
ONNX model on Cloud AI 100

run_hf_model_on_pytorch(model_hf)[source]

Function responsible for running HuggingFace PyTorch model and return the output tokens

Mandatory Args:

model_hf (torch.nn.module):: Original PyTorch model

Return:

numpy.ndarray:: Generated output tokens

run_hf_model_on_pytorch_CB(model_hf)[source]

Function responsible for running HuggingFace PyTorch model and return the output tokens

Mandatory Args:

model_hf (torch.nn.module):: Original PyTorch model

Return:

numpy.ndarray:: Generated output tokens

run_kv_model_on_cloud_ai_100(qpc_path, device_group=None)[source]

Function responsible for running ONNX model on Cloud AI 100 and return the output tokens

Mandatory Args:

qpc_path (str):: path to qpc generated after compilation
device_group (List[int]):: Device Ids to be used for compilation. if len(device_group) > 1. Multiple Card setup is enabled.

Return:

numpy.ndarray:: Generated output tokens

run_kv_model_on_ort(model_path, is_tlm=False)[source]

Function responsible for running ONNX model on onnxruntime and return the output tokens

Mandatory Args:

model_path (str):: Path to the Onnx model.

Return:

numpy.ndarray:: Generated output tokens

run_kv_model_on_pytorch(model)[source]

Function responsible for running KV PyTorch model and return the output tokens

Mandatory Args: :model (torch.nn.module): Transformed PyTorch model

Return:

numpy.ndarray:: Generated output tokens

run_ort_session(inputs, session) → dict[source]

Function responsible for running onnxrt session with given inputs and passing retained state outputs to be used for next iteration inputs

Mandatory Args:

inputs (Dict):
session (onnxruntime.capi.onnxruntime_inference_collection.InferenceSession):

Return:

Dict:: Numpy outputs of Onnx model

Low Level API

convert_to_cloud_kvstyle

convert_to_cloud_bertstyle

utils

ApiRunner class is responsible for running:

`convert_to_cloud_kvstyle`

`convert_to_cloud_bertstyle`

`utils`