Low Level API
convert_to_cloud_kvstyle
- QEfficient.exporter.export_hf_to_cloud_ai_100.convert_to_cloud_kvstyle(model_name: str, qeff_model: QEFFAutoModelForCausalLM, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, onnx_dir_path: str, seq_len: int) str [source]
API to convert model with kv retention and export to ONNX. KV Style Approach-
This architecture is particularly suitable for auto-regressive tasks.
where sequence generation involves processing one token at a time.
And contextual information from earlier tokens is crucial for predicting the next token.
The inclusion of a kV cache enhances the efficiency of the decoding process, making it more computationally efficient.
Mandatory
Args:- model_name (str):
Hugging Face Model Card name, Example: gpt2.
- qeff_model (QEFFAutoModelForCausalLM):
Transformed KV torch model to be used.
- tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):
Model tokenizer.
- onnx_dir_path (str):
Path to save exported ONNX file.
- seq_len (int):
The length of the sequence.
- Returns:
- str:
Path of exported
ONNX
file.
convert_to_cloud_bertstyle
- QEfficient.exporter.export_hf_to_cloud_ai_100.convert_to_cloud_bertstyle(model_name: str, qeff_model: QEFFAutoModelForCausalLM, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, onnx_dir_path: str, seq_len: int) str [source]
API to convert model to Bertstyle approach. Bertstyle Approach:
No Prefill/Decode separably compiled.
No KV retention logic.
KV is every time computed for all the tokens until EOS/max_length.
Mandatory
Args:- model_name (str):
Hugging Face Model Card name, Example: gpt2.
- qeff_model (QEFFAutoModelForCausalLM):
Transformed KV torch model to be used.
- tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):
Model tokenizer.
- onnx_dir_path (str):
Path to save exported ONNX file.
- seq_len (int):
The length of the sequence.
- Returns:
- str:
Path of exported
ONNX
file.
utils
- QEfficient.utils.device_utils.get_available_device_id()[source]
API to check available device id.
- Return:
- int:
Available device id.
- class QEfficient.utils.generate_inputs.InputHandler(batch_size, tokenizer, config, prompt, prompt_len, ctx_len, full_batch_size)[source]
Bases:
object
- prepare_ort_inputs()[source]
Function responsible for creating Prefill stage numpy inputs for ONNX model to be run on ONNXRT.
- Return:
- Dict:
input_ids, position_ids, past_key_values
- prepare_pytorch_inputs()[source]
Function responsible for creating Prefill stage tensor inputs for PyTorch model.
- Return:
- Dict:
input_ids, position_ids, past_key_values
- update_ort_inputs(inputs, ort_outputs)[source]
Function responsible for updating Prefill stage inputs to create inputs for decode stage inputs for ONNX model to be run on ONNXRT.
Mandatory
Args:- inputs (Dict):
NumPy inputs of Onnx model from previous iteration
- ort_outputs (Dict):
Numpy outputs of Onnx model from previous iteration
- Return:
- Dict:
Updated input_ids, position_ids and past_key_values
- update_ort_outputs(ort_outputs)[source]
Function responsible for updating ONNXRT session outputs.
Mandatory
Args:- ort_outputs (Dict):
Numpy outputs of Onnx model from current iteration
- Return:
updated_outputs (Dict): Updated past_key_values, logits
- update_pytorch_inputs(inputs, pt_outputs)[source]
Function responsible for updating Prefill stage inputs to create decode stage inputs for PyTorch model.
Mandatory
Args:- inputs (Dict):
Pytorch inputs from previous iteration
- pt_outputs (Dict):
Pytorch outputs from previous iteration
- Return:
- Dict:
Updated input_ids, position_ids and past_key_values
- class QEfficient.utils.run_utils.ApiRunner(batch_size, tokenizer, config, prompt, prompt_len, ctx_len, full_batch_size=None)[source]
Bases:
object
ApiRunner class is responsible for running:
HuggingFace
PyTorch
modelTransformed KV Pytorch Model
ONNX
model on ONNXRTONNX
model on Cloud AI 100
- run_hf_model_on_pytorch(model_hf)[source]
Function responsible for running HuggingFace
PyTorch
model and return the output tokensMandatory
Args:- model_hf (torch.nn.module):
Original
PyTorch
model
- Return:
- numpy.ndarray:
Generated output tokens
- run_hf_model_on_pytorch_CB(model_hf)[source]
Function responsible for running HuggingFace
PyTorch
model and return the output tokensMandatory
Args:- model_hf (torch.nn.module):
Original
PyTorch
model
- Return:
- numpy.ndarray:
Generated output tokens
- run_kv_model_on_cloud_ai_100(qpc_path, device_group=None)[source]
Function responsible for running
ONNX
model on Cloud AI 100 and return the output tokensMandatory
Args:- qpc_path (str):
path to qpc generated after compilation
- device_group (List[int]):
Device Ids to be used for compilation. if len(device_group) > 1. Multiple Card setup is enabled.
- Return:
- numpy.ndarray:
Generated output tokens
- run_kv_model_on_ort(model_path)[source]
Function responsible for running
ONNX
model on onnxruntime and return the output tokensMandatory
Args:- model_path (str):
Path to the Onnx model.
- Return:
- numpy.ndarray:
Generated output tokens
- run_kv_model_on_pytorch(model)[source]
Function responsible for running KV
PyTorch
model and return the output tokensMandatory
Args: :model (torch.nn.module): TransformedPyTorch
model- Return:
- numpy.ndarray:
Generated output tokens
- run_ort_session(inputs, session) dict [source]
Function responsible for running onnxrt session with given inputs and passing retained state outputs to be used for next iteration inputs
Mandatory
Args:- inputs (Dict):
- session (onnxruntime.capi.onnxruntime_inference_collection.InferenceSession):
- Return:
- Dict:
Numpy outputs of Onnx model