Note

Use bash terminal, else if using ZSH terminal then device_groupshould be in single quotes e.g. '--device_group [0]'

QEfficient.cloud.infer

  1. Check if compiled qpc for given config already exists, if it does jump to execute, else

  2. Check if exported ONNX file already exists, if true, jump to compilation -> execution, else

  3. Check if HF model exists in cache, if true, start transform -> export -> compilation -> execution, else,

4. Download HF model -> transform -> export -> compile -> execute Mandatory Args:

model_name (str):

Hugging Face Model Card name, Example: gpt2

num_cores (int):

Number of cores to compile model on.

Optional Args:
device_group (List[int]):

Device Ids to be used for compilation. If len(device_group) > 1, multiple Card setup is enabled. Defaults to None.

prompt (str):

Sample prompt for the model text generation. Defaults to None.

prompts_txt_file_path (str):

Path to txt file for multiple input prompts. Defaults to None.

aic_enable_depth_first (bool):

Enables DFS with default memory size. Defaults to False.

mos (int):

Effort level to reduce the on-chip memory. Defaults to -1.

batch_size (int):

Batch size to compile the model for. Defaults to 1.

full_batch_size (int):

Set full batch size to enable continuous batching mode. Default to None

prompt_len (int):

Prompt length for the model to compile. Defaults to 32.

ctx_len (int):

Maximum context length to compile the model. Defaults to 128.

generation_len (int):

Number of tokens to be generated. Defaults to False.

mxfp6 (bool):

Enable compilation for MXFP6 precision. Defaults to False.

mxint8 (bool):

Compress Present/Past KV to MXINT8 using CustomIO config. Defaults to False.

local_model_dir (str):

Path to custom model weights and config files. Defaults to None.

cache_dir (str):

Cache dir where downloaded HuggingFace files are stored. Defaults to None.

hf_token (str):

HuggingFace login token to access private repos. Defaults to None.

python -m QEfficient.cloud.infer OPTIONS

QEfficient.cloud.execute

Helper function used by execute CLI app to run the Model on Cloud AI 100 Platform.

Mandatory Args:
model_name (str):

Hugging Face Model Card name, Example: gpt2.

qpc_path (str):

Path to the generated binary after compilation.

Optional Args:
device_group (List[int]):

Device Ids to be used for compilation. if len(device_group) > 1. Multiple Card setup is enabled.``Defaults to None.``

local_model_dir (str):

Path to custom model weights and config files. Defaults to None.

prompt (str):

Sample prompt for the model text generation. Defaults to None.

prompts_txt_file_path (str):

Path to txt file for multiple input prompts. Defaults to None.

generation_len (int):

Number of tokens to be generated. Defaults to None.

cache_dir (str):

Cache dir where downloaded HuggingFace files are stored. Defaults to Constants.CACHE_DIR.

hf_token (str):

HuggingFace login token to access private repos. Defaults to None.

full_batch_size (int):

Set full batch size to enable continuous batching mode. Defaults to None.

python -m QEfficient.cloud.execute OPTIONS

QEfficient.cloud.compile

Compiles the given ONNX model using Cloud AI 100 platform SDK compiler and saves the compiled qpc package at qpc_path. Generates tensor-slicing configuration if multiple devices are passed in device_group.

This function will be deprecated soon and will be replaced by QEFFAutoModelForCausalLM.compile.

Mandatory Args:
onnx_path (str):

Generated ONNX Model Path.

qpc_path (str):

Path for saving compiled qpc binaries.

num_cores (int):

Number of cores to compile the model on.

Optional Args:
device_group (List[int]):

Used for finding the number of devices to compile for. Defaults to None.

aic_enable_depth_first (bool):

Enables DFS with default memory size. Defaults to False.

mos (int):

Effort level to reduce the on-chip memory. Defaults to -1.

batch_size (int):

Batch size to compile the model for. Defaults to 1.

full_batch_size (int):

Set full batch size to enable continuous batching mode. Default to None

prompt_len (int):

Prompt length for the model to compile. Defaults to 32

ctx_len (int):

Maximum context length to compile the model. Defaults to 128

mxfp6 (bool):

Enable compilation for MXFP6 precision. Defaults to True.

mxint8 (bool):

Compress Present/Past KV to MXINT8 using CustomIO config. Defaults to False.

custom_io_file_path (str):

Path to customIO file (formatted as a string). Defaults to None.

Returns:
str:

Path to compiled qpc package.

python -m QEfficient.cloud.compile OPTIONS

QEfficient.cloud.export

Helper function used by export CLI app for exporting to ONNX Model.

Mandatory Args:
model_name (str):

Hugging Face Model Card name, Example: gpt2.

Optional Args:
cache_dir (str):

Cache dir where downloaded HuggingFace files are stored. Defaults to None.

hf_token (str):

HuggingFace login token to access private repos. Defaults to None.

local_model_dir (str):

Path to custom model weights and config files. Defaults to None.

full_batch_size (int):

Set full batch size to enable continuous batching mode. Defaults to None.

python -m QEfficient.cloud.export OPTIONS