This page give you an overview about the all the APIs that you might need to integrate the QEfficient into your python applications.
High Level API
QEFFAutoModelForCausalLM
- class QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCausalLM(model: Module, pretrained_model_name_or_path: str, **kwargs)[source]
The QEFF class is designed for manipulating any causal language model from the HuggingFace hub. Although it is possible to initialize the class directly, we highly recommend using the
from_pretrainedmethod for initialization. Please note that the QEFF class is also a part of theQEfficientmodule.MandatoryArgs:- model (nn.Module):
PyTorch model
- pretrained_model_name_or_path (str):
We recommend passing name of the model as input here, as you are not using from_pretrained method. This name will be used for deciding path of the
ONNX/qpcfiles generated duringexport,compilationstages.
from QEfficient import QEFFAutoModelForCausalLM
- compile(num_cores: int, device_group: List[int] | None = None, batch_size: int = 1, prompt_len: int = 32, ctx_len: int = 128, mxfp6: bool = True, mxint8: bool = False, mos: int = -1, aic_enable_depth_first: bool = False) str[source]
This method compiles the exported
ONNXmodel using the Cloud AI 100 Platform SDK compiler binary found at/opt/qti-aic/exec/qaic-execand generates aqpcpackage. If the model has not been exported yet, this method will handle the export process. The generatedqpccan be found under the directoryefficient-transformers/qeff_models/{self.model_card_name}/qpc.MandatoryArgs:- num_cores (int):
Number of cores used to compile the model.
- device_group (List[int]):
If this is a list of more that one integers, tensor-slicing is invoked, defaults to None, and automatically chooses suitable device.
OptionalArgs:- model_card_name (Optional[str], optional):
Name of the model, Mandatory if
self.pretrained_model_name_or_pathis a path.Defaults to None.- batch_size (int, optional):
Batch size.
Defaults to 1.- prompt_len (int, optional):
The length of the Prefill prompt should be less that
prompt_len.Defaults to 32.- ctx_len (int, optional):
Maximum
ctxthat the compiled model can remember.Defaults to 128.- mxfp6 (bool, optional):
Whether to use
mxfp6compression for weights.Defaults to True.- mxint8 (bool, optional):
Whether to use
mxint8compression for KV cache.Defaults to False.- mos (int, optional):
Effort level to reduce on-chip memory. Defaults to -1, meaning no effort.
Defaults to -1.- aic_enable_depth_first (bool, optional):
Enables DFS with default memory size.
Defaults to False.
- Returns:
- str:
Path of the compiled
qpcpackage.
- export() str[source]
Exports the model to
ONNXformat usingtorch.onnx.export. The model should already be transformed i.e.self.is_transformedshould beTrue. Otherwise, this will raise anAssertionError. We currently don’t support exporting non-transformed models. Please refer to theconvert_to_cloud_bertstylefunction in the Low-Level API for a legacy function that supports this.”OptionalArgs:does not any arguments.
- Raises:
- AttributeError:
If
pretrained_model_name_or_pathis a path, this function needs model card name of the model so that it can distinguish between directories while saving theONNXfiles generated. So, user needs to passmodel_card_nameas a validstringin that case, Otherwise this will raise the error.
- Returns:
- str:
Path of the generated
ONNXgraph.
- export_and_compile(num_cores: int, device_group: List[int], batch_size: int = 1, prompt_len: int = 32, ctx_len: int = 128, mxfp6: bool = True, mxint8: bool = False, mos: int = -1, aic_enable_depth_first: bool = False, qpc_dir_suffix: str | None = None, full_batch_size: int | None = None) str[source]
This API is specific to Internal VLLM use-case and is not recommended to be used in your application unless your are using VLLM.
- classmethod from_pretrained(*args, **kwargs)
- generate(prompts: List[str], device_id: List[int] | None = None, runtime: str = 'AI_100', **kwargs)[source]
This method generates output until
eosorgeneration_lenby executing the compiledqpconCloud AI 100Hardware cards. This is a sequential execution based on thebatch_sizeof the compiled model and the number of prompts passed. If the number of prompts cannot be divided by thebatch_size, the last unfulfilled batch will be dropped.MandatoryArgs:- prompts (List[str]):
List of prompts to run the execution.
- device_id (List[int]):
Ids of devices for running the qpc pass as [0] in case of normal model / [0, 1, 2, 3] in case of tensor slicing model
optionalArgs:- runtime (str, optional):
Only
AI_100runtime is supported as of now;ONNXRTandPyTorchcoming soon. Defaults to “AI_100”.
- property tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast
Returns the tokenizer for given model based on
self.pretrained_model_name_or_path. Loads the tokenizer if required.- Returns:
- Union[PreTrainedTokenizer, PreTrainedTokenizerFast]:
Tokenizer from
transformersfor the given model.
- transform(**kwargs)[source]
This method applies all relevant optimization transforms on the model and toggles the
self.is_transformedattribute to True. If the model is already transformed, the method will simply return. Please note that this method does not require any input arguments.”- Returns:
- obj:
Same object with transformed
self.model
QEffAutoPeftModelForCausalLM
- class QEfficient.peft.auto.QEffAutoPeftModelForCausalLM(model: Module)[source]
QEff class for loading models with PEFT adapters (Only LoRA is supported currently). Once exported and compiled for an adapter, the same can be utilized for another adapter with same base model and adapter config.
- Args:
- model (nn.Module):
PyTorch model
from QEfficient import QEffAutoPeftModelForCausalLM m = QEffAutoPeftModelForCausalLM.from_pretrained("predibase/magicoder", "magicoder") m.export() m.compile(prefill_seq_len=32, ctx_len=1024) inputs = ... # A coding prompt outputs = m.generate(**inputs) inputs = ... # A math prompt m.load_adapter("predibase/gsm8k", "gsm8k") m.set_adapter("gsm8k") outputs = m.generate(**inputs)
- property active_adapter: str
Currently active adapter to be used for inference
- compile(onnx_path: str | None = None, compile_dir: str | None = None, *, batch_size: int = 1, prefill_seq_len: int, ctx_len: int, num_devices: int = 1, num_cores: int = 16, mxfp6_matmul: bool = False, mxint8_kv_cache: bool = False, **compiler_options) str[source]
Compile the exported onnx to run on AI100
- Args:
- onnx_path (str):
Onnx file to compile
- compile_dir (str):
Directory path to compile the qpc. A suffix is added to the directory path to avoid reusing same qpc for different parameters.
- batch_size (int):
Batch size to compile for.
Defaults to 1.- prefill_seq_len (int):
Prefill sequence length to compile for. Prompt will be chunked according to this length.
- ctx_len (int):
Context length to allocate space for KV-cache tensors.
- num_devices (int):
Number of devices to compile for.
Defaults to 1.- num_cores (int):
Number of cores to utilize in each device
Defaults to 16.- mxfp6_matmul (bool):
Use MXFP6 to compress weights for MatMul nodes to run faster on device.
Defaults to False.- mxint8_kv_cache (bool):
Use MXINT8 to compress KV-cache on device to access and update KV-cache faster.
Defaults to False.- compiler_options:
Pass any compiler option as input. Any flag that is supported by
qaic-execcan be passed. Params are converted to flags as below: - aic_num_cores=16 -> -aic-num-cores=16 - convert_to_fp16=True -> -convert-to-fp16
- export(export_dir: str | None = None) str[source]
Export the pytorch model to ONNX.
- Args:
- export_dir (str):
Specify the export directory. The export_dir will be suffixed with a hash corresponding to current model.
- classmethod from_pretrained(pretrained_name_or_path: str, *args, **kwargs)[source]
- Args:
- pretrained_name_or_path (str):
Model card name from huggingface or local path to model directory.
- args, kwargs:
Additional arguments to pass to peft.AutoPeftModelForCausalLM.
- generate(inputs: Tensor | ndarray | None = None, generation_config: GenerationConfig | None = None, stopping_criteria: StoppingCriteria | None = None, streamer: BaseStreamer | None = None, **kwargs) ndarray[source]
Generate tokens from compiled binary. This method takes same parameters as huggingface model.generate() method.
- Args:
- inputs:
input_ids
- generation_config:
Merge this generation_config with model-specific for the current generation.
- stopping_criteria:
Pass custom stopping_criteria to stop at a specific point in generation.
- streamer:
Streamer to put the generated tokens into.
- kwargs:
Additional parameters for generation_config or to be passed to the model while generating.
export
- QEfficient.exporter.export_hf_to_cloud_ai_100.qualcomm_efficient_converter(model_name: str, model_kv: QEFFBaseModel | None = None, local_model_dir: str | None = None, tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast | None = None, cache_dir: str | None = None, onnx_dir_path: str | None = None, hf_token: str | None = None, seq_length: int = 32, kv: bool = True, form_factor: str = 'cloud', full_batch_size: int | None = None) Tuple[str, str][source]
This method is an alias for
QEfficient.export.Usage 1: This method can be used by passing
model_nameandlocal_model_dirorcache_dirif required for loading from local dir. This will download the model fromHuggingFaceand export it toONNXgraph and returns generated files path check below.Usage 2: You can pass
model_nameandmodel_kvas an object ofQEfficient.QEFFAutoModelForCausalLM, In this case will directly export themodel_kv.modeltoONNXWe will be deprecating this function and it will be replaced by
QEffAutoModelForCausalLM.export.MandatoryArgs:- model_name (str):
The name of the model to be used.
OptionalArgs:- model_kv (torch.nn.Module):
Transformed
KV torch modelto be used.Defaults to None.- local_model_dir (str):
Path of local model.
Defaults to None.- tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):
Model tokenizer.
Defaults to None.- cache_dir (str):
Path of the
cachedirectory.Defaults to None.- onnx_dir_path (str):
Path to store
ONNXfile.Defaults to None.- hf_token (str):
HuggingFace token to access gated models.
Defaults is None.- seq_len (int):
The length of the sequence.
Defaults is 128.- kv (bool):
If false, it will export to Bert style.
Defaults is True.- form_factor (str):
Form factor of the hardware, currently only
cloudis accepted.Defaults to cloud.
- Returns:
- Tuple[str, str]:
Path to Base
ONNXdir and path to generatedONNXmodel
import QEfficient base_path, onnx_model_path = QEfficient.export(model_name="gpt2")
compile
- QEfficient.compile.compile_helper.compile(onnx_path: str, qpc_path: str, num_cores: int, device_group: List[int] | None = None, aic_enable_depth_first: bool = False, mos: int = -1, batch_size: int = 1, prompt_len: int = 32, ctx_len: int = 128, mxfp6: bool = True, mxint8: bool = False, custom_io_file_path: str | None = None, full_batch_size: int | None = None, **kwargs) str[source]
Compiles the given
ONNXmodel using Cloud AI 100 platform SDK compiler and saves the compiledqpcpackage atqpc_path. Generates tensor-slicing configuration if multiple devices are passed indevice_group.This function will be deprecated soon and will be replaced by
QEFFAutoModelForCausalLM.compile.MandatoryArgs:- onnx_path (str):
Generated
ONNXModel Path.- qpc_path (str):
Path for saving compiled qpc binaries.
- num_cores (int):
Number of cores to compile the model on.
OptionalArgs:- device_group (List[int]):
Used for finding the number of devices to compile for.
Defaults to None.- aic_enable_depth_first (bool):
Enables
DFSwith default memory size.Defaults to False.- mos (int):
Effort level to reduce the on-chip memory.
Defaults to -1.- batch_size (int):
Batch size to compile the model for.
Defaults to 1.- full_batch_size (int):
Set full batch size to enable continuous batching mode.
Default to None- prompt_len (int):
Prompt length for the model to compile.
Defaults to 32- ctx_len (int):
Maximum context length to compile the model.
Defaults to 128- mxfp6 (bool):
Enable compilation for
MXFP6precision.Defaults to True.- mxint8 (bool):
Compress Present/Past KV to
MXINT8usingCustomIOconfig.Defaults to False.- custom_io_file_path (str):
Path to
customIOfile (formatted as a string).Defaults to None.
- Returns:
- str:
Path to compiled
qpcpackage.
Execute
- class QEfficient.generation.text_generation_inference.CloudAI100ExecInfo(batch_size: int, generated_texts: List[str] | List[List[str]], generated_ids: List[ndarray] | ndarray, prefill_time: float, decode_perf: float, total_perf: float, total_time: float)[source]
Bases:
objectHolds all the information about Cloud AI 100 execution
- Args:
- batch_size (int):
Batch size of the QPC compilation.
- generated_texts (Union[List[List[str]], List[str]]):
Generated text(s).
- generated_ids (Union[List[np.ndarray], np.ndarray]):
Generated IDs.
- prefill_time (float):
Time for prefilling.
- decode_perf (float):
Decoding performance.
- total_perf (float):
Total performance.
- total_time (float):
Total time.
- QEfficient.generation.text_generation_inference.cloud_ai_100_exec_kv(tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast, qpc_path: str, prompt: str | None = None, prompts_txt_file_path: str | None = None, device_id: List[int] | None = None, generation_len: int | None = None, enable_debug_logs: bool = False, stream: bool = True, write_io_dir: str | None = None, automation=False, full_batch_size: int | None = None)[source]
This method generates output until
eosorgeneration_lenby executing the compiledqpconCloud AI 100Hardware cards. This is a sequential execution based on thebatch_sizeof the compiled model and the number of prompts passed. If the number of prompts cannot be divided by thebatch_size, the last unfulfilled batch will be dropped.MandatoryArgs:- tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]):
Model tokenizer.
- qpc_path (str):
Path to the saved generated binary file after compilation.
OptionalArgs:- prompt (str):
Sample prompt for the model text generation.
Defaults to None.- prompts_txt_file_path (str):
Path of the prompt text file.
Defaults to None.- generation_len (int):
Maximum context length for the model during compilation.
Defaults to None.- device_id (List[int]):
Device IDs to be used for execution. If
len(device_id) > 1, it enables multiple card setup. IfNone, auto-device-picker will be used.Defaults to None.- enable_debug_logs (bool):
If True, it enables debugging logs.
Defaults to False.- stream (bool):
If True, enable streamer, which returns tokens one by one as the model generates them.
Defaults to True.- Write_io_dir (str):
Path to write the input and output files.
Defaults to None.- automation (bool):
If true, it prints input, output, and performance stats.
Defaults to False.
- Returns:
- CloudAI100ExecInfo:
Object holding execution output and performance details.
import transformers import QEfficient base_path, onnx_model_path = QEfficient.export(model_name="gpt2") qpc_path = QEfficient.compile(onnx_path=onnx_model_path, qpc_path=os.path.join(base_path, "qpc"), num_cores=14, device_group=[0]) tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2") execinfo = QEfficient.cloud_ai_100_exec_kv(tokenizer=tokenizer, qpc_path=qpc_path, prompt="Hi there!!", device_id=[0])