CLI API Reference
Note
Use bash terminal
, else if using ZSH terminal
then device_group
should be in single quotes e.g. '--device_group [0]'
QEfficient.cloud.infer
- QEfficient.cloud.infer.main(model_name: str, num_cores: int, device_group: List[int] | None = None, prompt: str | None = None, prompts_txt_file_path: str | None = None, aic_enable_depth_first: bool = False, mos: int | None = 1, batch_size: int = 1, full_batch_size: int | None = None, prompt_len: int = 32, ctx_len: int = 128, generation_len: int | None = None, mxfp6: bool = False, mxint8: bool = False, local_model_dir: str | None = None, cache_dir: str | None = None, hf_token: str | None = None, allow_mxint8_mdp_io: bool = False, enable_qnn: bool | None = False, qnn_config: str | None = None, trust_remote_code: bool | None = False, **kwargs) None [source]
Main entry point for the QEfficient inference script.
This function handles the end-to-end process of downloading, optimizing, compiling, and executing a HuggingFace model on Cloud AI 100 hardware. The process follows these steps:
Checks for an existing compiled QPC package. If found, it jumps directly to execution.
Checks for an existing exported ONNX file. If true, it proceeds to compilation then execution.
Checks if the HuggingFace model exists in the cache. If true, it performs model transformation, ONNX export, compilation, and then execution.
If none of the above, it downloads the HuggingFace model, then performs transformation, ONNX export, compilation, and execution.
- Parameters:
model_name (str) – Hugging Face Model Card name (e.g.,
gpt2
) or path to a local model.num_cores (int) – Number of cores to compile the model on.
device_group (List[int], optional) – List of device IDs to be used for compilation and inference. If
len(device_group) > 1
, a multiple card setup is enabled. Default is None.prompt (str, optional) – Sample prompt(s) for the model text generation. For batch size > 1, pass multiple prompts separated by a pipe (
|
) symbol. Default is None.prompts_txt_file_path (str, optional) – Path to a text file containing multiple input prompts, one per line. Default is None.
aic_enable_depth_first (bool, optional) – Enables Depth-First Search (DFS) with default memory size during compilation. Default is False.
mos (int, optional) – Effort level to reduce on-chip memory. Default is 1.
batch_size (int, optional) – Batch size to compile the model for. Default is 1.
full_batch_size (int, optional) – Sets the full batch size to enable continuous batching mode. Default is None.
prompt_len (int, optional) – Prompt length for the model to compile. Default is 32.
ctx_len (int, optional) – Maximum context length to compile the model for. Default is 128.
generation_len (int, optional) – Maximum number of tokens to be generated during inference. Default is None.
mxfp6 (bool, optional) – Enables compilation for MXFP6 precision for constant MatMul weights. Default is False. A warning is issued as
--mxfp6
is deprecated; use--mxfp6-matmul
instead.mxint8 (bool, optional) – Compresses Present/Past KV to
MXINT8
usingCustomIO
config. Default is False. A warning is issued as--mxint8
is deprecated; use--mxint8-kv-cache
instead.local_model_dir (str, optional) – Path to custom model weights and config files. Default is None.
cache_dir (str, optional) – Cache directory where downloaded HuggingFace files are stored. Default is None.
hf_token (str, optional) – HuggingFace login token to access private repositories. Default is None.
allow_mxint8_mdp_io (bool, optional) – Allows MXINT8 compression of MDP IO traffic during compilation. Default is False.
enable_qnn (bool or str, optional) – Enables QNN compilation. Can be passed as a flag (True) or with a configuration file path (str). If a string path is provided, it’s treated as
qnn_config
. Default is False.qnn_config (str, optional) – Path of the QNN Config parameters file. Default is None.
trust_remote_code (bool, optional) – If True, trusts remote code when loading models from HuggingFace. Default is False.
**kwargs –
Additional compiler options passed directly to qaic-exec. Any flag supported by qaic-exec can be passed. Parameters are converted to flags as follows:
-allocator_dealloc_delay=1
->-allocator-dealloc-delay=1
-qpc_crc=True
->-qpc-crc
Example
To run inference from the command line:
python -m QEfficient.cloud.infer --model-name gpt2 --num-cores 16 --prompt "Hello world"
For advanced compilation options:
python -m QEfficient.cloud.infer --model-name meta-llama/Llama-3.2-11B-Vision-Instruct \ --num-cores 16 --prompt "Describe this image." --image-url "https://example.com/image.jpg" \ --ctx-len 512 --img-size 560 --mxfp6-matmul
QEfficient.cloud.execute
- QEfficient.cloud.execute.main(model_name: str, qpc_path: str, device_group: List[int] | None = None, local_model_dir: str | None = None, prompt: str | None = None, prompts_txt_file_path: str | None = None, generation_len: int | None = None, cache_dir: str | None = None, hf_token: str | None = None, full_batch_size: int | None = None)[source]
Main function for the QEfficient execution CLI application.
This function serves as the entry point for running a compiled model (QPC package) on the Cloud AI 100 Platform. It loads the necessary tokenizer and then orchestrates the text generation inference.
- Parameters:
model_name (str) – Hugging Face Model Card name (e.g.,
gpt2
) for loading the tokenizer.qpc_path (str) – Path to the generated binary (QPC package) after compilation.
device_group (List[int], optional) – List of device IDs to be used for inference. If len(device_group) > 1, a multi-card setup is enabled. Default is None.
local_model_dir (str, optional) – Path to custom model weights and config files, used if not loading tokenizer from Hugging Face Hub. Default is None.
prompt (str, optional) – Sample prompt(s) for the model text generation. For batch size > 1, pass multiple prompts separated by a pipe (
|
) symbol. Default is None.prompts_txt_file_path (str, optional) – Path to a text file containing multiple input prompts, one per line. Default is None.
generation_len (int, optional) – Maximum number of tokens to be generated during inference. Default is None.
cache_dir (str, optional) – Cache directory where downloaded HuggingFace files (like tokenizer) are stored. Default is None.
hf_token (str, optional) – HuggingFace login token to access private repositories. Default is None.
full_batch_size (int, optional) – Ignored in this context as continuous batching is managed by the compiled QPC. However, it might be passed through from CLI arguments. Default is None.
Example
To execute a compiled model from the command line:
python -m QEfficient.cloud.execute --model-name gpt2 --qpc-path /path/to/qpc/binaries --prompt "Hello world"
For multi-device inference:
python -m QEfficient.cloud.execute --model-name gpt2 --qpc-path /path/to/qpc/binaries --device-group "[0,1]" --prompt "Hello | Hi"
QEfficient.cloud.compile
- QEfficient.compile.compile_helper.compile(onnx_path: str, qpc_path: str, num_cores: int, device_group: List[int] | None = None, aic_enable_depth_first: bool = False, mos: int = -1, batch_size: int = 1, prompt_len: int = 32, ctx_len: int = 128, mxfp6: bool = True, mxint8: bool = False, custom_io_file_path: str | None = None, full_batch_size: int | None = None, allow_mxint8_mdp_io: bool | None = False, enable_qnn: bool | None = False, qnn_config: str | None = None, **kwargs) str [source]
Compiles the given ONNX model using either the Cloud AI 100 platform SDK compiler or the QNN compiler, and saves the compiled QPC package.
This function handles the creation of specialization files, selection of custom IO configurations, and execution of the appropriate compiler (QAIC or QNN). It supports multi-device compilation for tensor slicing.
- Parameters:
onnx_path (str) – Path to the generated ONNX model file.
qpc_path (str) – Target directory path for saving the compiled QPC binaries.
num_cores (int) – Number of cores to use for compilation.
device_group (List[int], optional) – List of device IDs. Used to determine the number of devices for multi-device compilation. Default is None.
aic_enable_depth_first (bool, optional) – If True, enables Depth-First Search (DFS) optimization with default memory size during QAIC compilation. Default is False.
mos (int, optional) – Effort level to reduce on-chip memory during QAIC compilation. A value greater than 0 applies this effort. Default is -1 (no effort).
batch_size (int, optional) – Batch size to compile the model for. Default is 1.
full_batch_size (int, optional) – Sets the full batch size to enable continuous batching mode. If provided, batch_size must be 1. Default is None.
prompt_len (int, optional) – Prompt length for the model to compile. Default is 32.
ctx_len (int, optional) – Maximum context length to compile the model for. Default is 128.
mxfp6 (bool, optional) – If True, enables MXFP6 precision for MatMul weights during compilation. Default is True.
mxint8 (bool, optional) – If True, compresses Present/Past KV to MXINT8 using a CustomIO configuration. Default is False.
custom_io_file_path (str, optional) – Explicit path to a Custom IO file (e.g., YAML format). If None, it’s inferred based on mxint8. Default is None.
allow_mxint8_mdp_io (bool, optional) – If True, allows MXINT8 compression of MDP IO traffic during QAIC compilation. Default is False.
enable_qnn (bool, optional) – If True, enables compilation using the QNN compiler instead of QAIC. Default is False.
qnn_config (str, optional) – Path to the QNN Config parameters file, used if enable_qnn is True. Default is None.
**kwargs – Additional compiler options passed directly to the chosen compiler.
- Returns:
Path to the compiled QPC package directory.
- Return type:
str
- Raises:
ValueError – If both batch_size and full_batch_size are greater than one (mutually exclusive in some contexts).
FileNotFoundError – If required Custom IO files are not found.
Warning
- DeprecationWarning
This method will be removed soon; use QEFFAutoModelForCausalLM.compile instead.
QEfficient.cloud.export
- QEfficient.cloud.export.main(model_name: str, cache_dir: str | None = None, hf_token: str | None = None, local_model_dir: str | None = None, full_batch_size: int | None = None) None [source]
Main function for the QEfficient ONNX export CLI application.
This function serves as the entry point for exporting a PyTorch model, loaded via QEFFCommonLoader, to the ONNX format. It prepares the necessary paths and calls get_onnx_model_path.
- Parameters:
model_name (str) – Hugging Face Model Card name (e.g.,
gpt2
).cache_dir (str, optional) – Cache directory where downloaded HuggingFace files are stored. Default is None.
hf_token (str, optional) – HuggingFace login token to access private repositories. Default is None.
local_model_dir (str, optional) – Path to custom model weights and config files. Default is None.
full_batch_size (int, optional) – Sets the full batch size to enable continuous batching mode. Default is None.
Example
To export a model from the command line:
python -m QEfficient.cloud.export --model-name gpt2 --cache-dir /path/to/cache
QEfficient.cloud.finetune
- QEfficient.cloud.finetune.main(**kwargs) None [source]
Fine-tune a Hugging Face model on Qualcomm AI 100 hardware with configurable training and Parameter-Efficient Fine-Tuning (PEFT) parameters.
This is the main entry point for the fine-tuning script. It orchestrates the setup of distributed training, model and tokenizer loading, DataLoader creation, optimizer and scheduler initialization, and the training loop.
- Parameters:
**kwargs – Additional arguments used to override default parameters in TrainConfig and PEFT configuration. These are typically parsed from command-line arguments.
Example
To fine-tune a model using a YAML configuration file for PEFT:
python -m QEfficient.cloud.finetune \ --model_name "meta-llama/Llama-3.2-1B" \ --lr 5e-4 \ --peft_config_file "lora_config.yaml"
To fine-tune a model using a default LoRA configuration:
python -m QEfficient.cloud.finetune \ --model_name "meta-llama/Llama-3.2-1B" \ --lr 5e-4