Note
Use bash terminal, else if using ZSH terminal then device_groupshould be in single quotes e.g. '--device_group [0]'
QEfficient.cloud.infer
Check if compiled qpc for given config already exists, if it does jump to execute, else
Check if exported ONNX file already exists, if true, jump to compilation -> execution, else
Check if HF model exists in cache, if true, start transform -> export -> compilation -> execution, else,
4. Download HF model -> transform -> export -> compile -> execute
Mandatory Args:
- model_name (str):
Hugging Face Model Card name, Example:
gpt2- num_cores (int):
Number of cores to compile model on.
OptionalArgs:- device_group (List[int]):
Device Ids to be used for compilation. If
len(device_group) > 1, multiple Card setup is enabled.Defaults to None.- prompt (str):
Sample prompt for the model text generation.
Defaults to None.- prompts_txt_file_path (str):
Path to txt file for multiple input prompts.
Defaults to None.- aic_enable_depth_first (bool):
Enables
DFSwith default memory size.Defaults to False.- mos (int):
Effort level to reduce the on-chip memory.
Defaults to -1.- batch_size (int):
Batch size to compile the model for.
Defaults to 1.- full_batch_size (int):
Set full batch size to enable continuous batching mode.
Default to None- prompt_len (int):
Prompt length for the model to compile.
Defaults to 32.- ctx_len (int):
Maximum context length to compile the model.
Defaults to 128.- generation_len (int):
Number of tokens to be generated.
Defaults to False.- mxfp6 (bool):
Enable compilation for MXFP6 precision.
Defaults to False.- mxint8 (bool):
Compress Present/Past KV to
MXINT8usingCustomIOconfig.Defaults to False.- local_model_dir (str):
Path to custom model weights and config files.
Defaults to None.- cache_dir (str):
Cache dir where downloaded HuggingFace files are stored.
Defaults to None.- hf_token (str):
HuggingFace login token to access private repos.
Defaults to None.
python -m QEfficient.cloud.infer OPTIONS
QEfficient.cloud.execute
Helper function used by execute CLI app to run the Model on Cloud AI 100 Platform.
MandatoryArgs:- model_name (str):
Hugging Face Model Card name, Example:
gpt2.- qpc_path (str):
Path to the generated binary after compilation.
OptionalArgs:- device_group (List[int]):
Device Ids to be used for compilation. if len(device_group) > 1. Multiple Card setup is enabled.``Defaults to None.``
- local_model_dir (str):
Path to custom model weights and config files.
Defaults to None.- prompt (str):
Sample prompt for the model text generation.
Defaults to None.- prompts_txt_file_path (str):
Path to txt file for multiple input prompts.
Defaults to None.- generation_len (int):
Number of tokens to be generated.
Defaults to None.- cache_dir (str):
Cache dir where downloaded HuggingFace files are stored.
Defaults to Constants.CACHE_DIR.- hf_token (str):
HuggingFace login token to access private repos.
Defaults to None.- full_batch_size (int):
Set full batch size to enable continuous batching mode.
Defaults to None.
python -m QEfficient.cloud.execute OPTIONS
QEfficient.cloud.compile
Compiles the given
ONNXmodel using Cloud AI 100 platform SDK compiler and saves the compiledqpcpackage atqpc_path. Generates tensor-slicing configuration if multiple devices are passed indevice_group.This function will be deprecated soon and will be replaced by
QEFFAutoModelForCausalLM.compile.
MandatoryArgs:
- onnx_path (str):
Generated
ONNXModel Path.- qpc_path (str):
Path for saving compiled qpc binaries.
- num_cores (int):
Number of cores to compile the model on.
OptionalArgs:
- device_group (List[int]):
Used for finding the number of devices to compile for.
Defaults to None.- aic_enable_depth_first (bool):
Enables
DFSwith default memory size.Defaults to False.- mos (int):
Effort level to reduce the on-chip memory.
Defaults to -1.- batch_size (int):
Batch size to compile the model for.
Defaults to 1.- full_batch_size (int):
Set full batch size to enable continuous batching mode.
Default to None- prompt_len (int):
Prompt length for the model to compile.
Defaults to 32- ctx_len (int):
Maximum context length to compile the model.
Defaults to 128- mxfp6 (bool):
Enable compilation for
MXFP6precision.Defaults to True.- mxint8 (bool):
Compress Present/Past KV to
MXINT8usingCustomIOconfig.Defaults to False.- custom_io_file_path (str):
Path to
customIOfile (formatted as a string).Defaults to None.- Returns:
- str:
Path to compiled
qpcpackage.python -m QEfficient.cloud.compile OPTIONS
QEfficient.cloud.export
Helper function used by export CLI app for exporting to ONNX Model.
MandatoryArgs:
- model_name (str):
Hugging Face Model Card name, Example:
gpt2.OptionalArgs:
- cache_dir (str):
Cache dir where downloaded HuggingFace files are stored.
Defaults to None.- hf_token (str):
HuggingFace login token to access private repos.
Defaults to None.- local_model_dir (str):
Path to custom model weights and config files.
Defaults to None.- full_batch_size (int):
Set full batch size to enable continuous batching mode.
Defaults to None.python -m QEfficient.cloud.export OPTIONS