Note
Use bash terminal
, else if using ZSH terminal
then device_group
should be in single quotes e.g. '--device_group [0]'
QEfficient.cloud.infer
Check if compiled qpc for given config already exists, if it does jump to execute, else
Check if exported ONNX file already exists, if true, jump to compilation -> execution, else
Check if HF model exists in cache, if true, start transform -> export -> compilation -> execution, else,
4. Download HF model -> transform -> export -> compile -> execute
Mandatory
Args:
- model_name (str):
Hugging Face Model Card name, Example:
gpt2
- num_cores (int):
Number of cores to compile model on.
Optional
Args:- device_group (List[int]):
Device Ids to be used for compilation. If
len(device_group) > 1
, multiple Card setup is enabled.Defaults to None.
- prompt (str):
Sample prompt for the model text generation.
Defaults to None.
- prompts_txt_file_path (str):
Path to txt file for multiple input prompts.
Defaults to None.
- aic_enable_depth_first (bool):
Enables
DFS
with default memory size.Defaults to False.
- mos (int):
Effort level to reduce the on-chip memory.
Defaults to -1.
- batch_size (int):
Batch size to compile the model for.
Defaults to 1.
- full_batch_size (int):
Set full batch size to enable continuous batching mode.
Default to None
- prompt_len (int):
Prompt length for the model to compile.
Defaults to 32.
- ctx_len (int):
Maximum context length to compile the model.
Defaults to 128.
- generation_len (int):
Number of tokens to be generated.
Defaults to False.
- mxfp6 (bool):
Enable compilation for MXFP6 precision.
Defaults to False.
- mxint8 (bool):
Compress Present/Past KV to
MXINT8
usingCustomIO
config.Defaults to False.
- local_model_dir (str):
Path to custom model weights and config files.
Defaults to None.
- cache_dir (str):
Cache dir where downloaded HuggingFace files are stored.
Defaults to None.
- hf_token (str):
HuggingFace login token to access private repos.
Defaults to None.
python -m QEfficient.cloud.infer OPTIONS
QEfficient.cloud.execute
Helper function used by execute CLI app to run the Model on Cloud AI 100
Platform.
Mandatory
Args:- model_name (str):
Hugging Face Model Card name, Example:
gpt2
.- qpc_path (str):
Path to the generated binary after compilation.
Optional
Args:- device_group (List[int]):
Device Ids to be used for compilation. if len(device_group) > 1. Multiple Card setup is enabled.``Defaults to None.``
- local_model_dir (str):
Path to custom model weights and config files.
Defaults to None.
- prompt (str):
Sample prompt for the model text generation.
Defaults to None.
- prompts_txt_file_path (str):
Path to txt file for multiple input prompts.
Defaults to None.
- generation_len (int):
Number of tokens to be generated.
Defaults to None.
- cache_dir (str):
Cache dir where downloaded HuggingFace files are stored.
Defaults to Constants.CACHE_DIR.
- hf_token (str):
HuggingFace login token to access private repos.
Defaults to None.
- full_batch_size (int):
Set full batch size to enable continuous batching mode.
Defaults to None.
python -m QEfficient.cloud.execute OPTIONS
QEfficient.cloud.compile
Compiles the given
ONNX
model using Cloud AI 100 platform SDK compiler and saves the compiledqpc
package atqpc_path
. Generates tensor-slicing configuration if multiple devices are passed indevice_group
.This function will be deprecated soon and will be replaced by
QEFFAutoModelForCausalLM.compile
.
Mandatory
Args:
- onnx_path (str):
Generated
ONNX
Model Path.- qpc_path (str):
Path for saving compiled qpc binaries.
- num_cores (int):
Number of cores to compile the model on.
Optional
Args:
- device_group (List[int]):
Used for finding the number of devices to compile for.
Defaults to None.
- aic_enable_depth_first (bool):
Enables
DFS
with default memory size.Defaults to False.
- mos (int):
Effort level to reduce the on-chip memory.
Defaults to -1.
- batch_size (int):
Batch size to compile the model for.
Defaults to 1.
- full_batch_size (int):
Set full batch size to enable continuous batching mode.
Default to None
- prompt_len (int):
Prompt length for the model to compile.
Defaults to 32
- ctx_len (int):
Maximum context length to compile the model.
Defaults to 128
- mxfp6 (bool):
Enable compilation for
MXFP6
precision.Defaults to True.
- mxint8 (bool):
Compress Present/Past KV to
MXINT8
usingCustomIO
config.Defaults to False.
- custom_io_file_path (str):
Path to
customIO
file (formatted as a string).Defaults to None.
- Returns:
- str:
Path to compiled
qpc
package.python -m QEfficient.cloud.compile OPTIONS
QEfficient.cloud.export
Helper function used by export CLI app for exporting to ONNX Model.
Mandatory
Args:
- model_name (str):
Hugging Face Model Card name, Example:
gpt2
.Optional
Args:
- cache_dir (str):
Cache dir where downloaded HuggingFace files are stored.
Defaults to None.
- hf_token (str):
HuggingFace login token to access private repos.
Defaults to None.
- local_model_dir (str):
Path to custom model weights and config files.
Defaults to None.
- full_batch_size (int):
Set full batch size to enable continuous batching mode.
Defaults to None.
python -m QEfficient.cloud.export OPTIONS