Low Level API

convert_to_cloud_kvstyle

QEfficient.exporter.export_hf_to_cloud_ai_100.convert_to_cloud_kvstyle(model_name: str, qeff_model: QEFFAutoModelForCausalLM, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, onnx_dir_path: str, seq_len: int) str[source]

API to convert model with kv retention and export to ONNX. KV Style Approach-

  1. This architecture is particularly suitable for auto-regressive tasks.

  2. where sequence generation involves processing one token at a time.

  3. And contextual information from earlier tokens is crucial for predicting the next token.

  4. The inclusion of a kV cache enhances the efficiency of the decoding process, making it more computationally efficient.

Mandatory Args:
model_name (str):

Hugging Face Model Card name, Example: gpt2.

qeff_model (QEFFAutoModelForCausalLM):

Transformed KV torch model to be used.

tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):

Model tokenizer.

onnx_dir_path (str):

Path to save exported ONNX file.

seq_len (int):

The length of the sequence.

Returns:
str:

Path of exported ONNX file.

convert_to_cloud_bertstyle

QEfficient.exporter.export_hf_to_cloud_ai_100.convert_to_cloud_bertstyle(model_name: str, qeff_model: QEFFAutoModelForCausalLM, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, onnx_dir_path: str, seq_len: int) str[source]

API to convert model to Bertstyle approach. Bertstyle Approach:

  1. No Prefill/Decode separably compiled.

  2. No KV retention logic.

  3. KV is every time computed for all the tokens until EOS/max_length.

Mandatory Args:
model_name (str):

Hugging Face Model Card name, Example: gpt2.

qeff_model (QEFFAutoModelForCausalLM):

Transformed KV torch model to be used.

tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):

Model tokenizer.

onnx_dir_path (str):

Path to save exported ONNX file.

seq_len (int):

The length of the sequence.

Returns:
str:

Path of exported ONNX file.

utils

QEfficient.utils.device_utils.get_available_device_id()[source]

API to check available device id.

Return:
int:

Available device id.

class QEfficient.utils.generate_inputs.InputHandler(batch_size, tokenizer, config, prompt, prompt_len, ctx_len, full_batch_size)[source]

Bases: object

prepare_ort_inputs()[source]

Function responsible for creating Prefill stage numpy inputs for ONNX model to be run on ONNXRT.

Return:
Dict:

input_ids, position_ids, past_key_values

prepare_pytorch_inputs()[source]

Function responsible for creating Prefill stage tensor inputs for PyTorch model.

Return:
Dict:

input_ids, position_ids, past_key_values

update_ort_inputs(inputs, ort_outputs)[source]

Function responsible for updating Prefill stage inputs to create inputs for decode stage inputs for ONNX model to be run on ONNXRT.

Mandatory Args:
inputs (Dict):

NumPy inputs of Onnx model from previous iteration

ort_outputs (Dict):

Numpy outputs of Onnx model from previous iteration

Return:
Dict:

Updated input_ids, position_ids and past_key_values

update_ort_outputs(ort_outputs)[source]

Function responsible for updating ONNXRT session outputs.

Mandatory Args:
ort_outputs (Dict):

Numpy outputs of Onnx model from current iteration

Return:

updated_outputs (Dict): Updated past_key_values, logits

update_pytorch_inputs(inputs, pt_outputs)[source]

Function responsible for updating Prefill stage inputs to create decode stage inputs for PyTorch model.

Mandatory Args:
inputs (Dict):

Pytorch inputs from previous iteration

pt_outputs (Dict):

Pytorch outputs from previous iteration

Return:
Dict:

Updated input_ids, position_ids and past_key_values

class QEfficient.utils.run_utils.ApiRunner(batch_size, tokenizer, config, prompt, prompt_len, ctx_len, full_batch_size=None)[source]

Bases: object

ApiRunner class is responsible for running:

  1. HuggingFace PyTorch model

  2. Transformed KV Pytorch Model

  3. ONNX model on ONNXRT

  4. ONNX model on Cloud AI 100

run_hf_model_on_pytorch(model_hf)[source]

Function responsible for running HuggingFace PyTorch model and return the output tokens

Mandatory Args:
model_hf (torch.nn.module):

Original PyTorch model

Return:
numpy.ndarray:

Generated output tokens

run_hf_model_on_pytorch_CB(model_hf)[source]

Function responsible for running HuggingFace PyTorch model and return the output tokens

Mandatory Args:
model_hf (torch.nn.module):

Original PyTorch model

Return:
numpy.ndarray:

Generated output tokens

run_kv_model_on_cloud_ai_100(qpc_path, device_group=None)[source]

Function responsible for running ONNX model on Cloud AI 100 and return the output tokens

Mandatory Args:
qpc_path (str):

path to qpc generated after compilation

device_group (List[int]):

Device Ids to be used for compilation. if len(device_group) > 1. Multiple Card setup is enabled.

Return:
numpy.ndarray:

Generated output tokens

run_kv_model_on_ort(model_path)[source]

Function responsible for running ONNX model on onnxruntime and return the output tokens

Mandatory Args:
model_path (str):

Path to the Onnx model.

Return:
numpy.ndarray:

Generated output tokens

run_kv_model_on_pytorch(model)[source]

Function responsible for running KV PyTorch model and return the output tokens

Mandatory Args: :model (torch.nn.module): Transformed PyTorch model

Return:
numpy.ndarray:

Generated output tokens

run_ort_session(inputs, session) dict[source]

Function responsible for running onnxrt session with given inputs and passing retained state outputs to be used for next iteration inputs

Mandatory Args:
inputs (Dict):

session (onnxruntime.capi.onnxruntime_inference_collection.InferenceSession):

Return:
Dict:

Numpy outputs of Onnx model