Skip to content
Cloud AI 100
API
Initializing search
User Guide
API
FAQ
Blogs
Cloud AI 100
User Guide
User Guide
Quick Start Guide
Quick Start Guide
Installation
Installation
Checklist
Pre-requisites
Cloud AI SDK
Hypervisors
Containers
Containers
Docker
Kubernetes
vLLM
AWS
Inference Workflow
Inference Workflow
Export the Model
Export the Model
Exporting ONNX Model from Different Frameworks
Operator and Datatype support
Introduction to the Model Preparator Tool
Compile the Model
Compile the Model
Compile the Model
Tune Performance
Execute the QPC
Execute the QPC
Model Execution
Inference Profiling
Triton Inference Server
Model Architecture Support
Model Architecture Support
Large-Language-Models
Large-Language-Models
Large Language Models (LLMs)
Features
Features
Custom Ops
Custom Ops
Custom Operations (C++)
Model Sharding
Model Sharding
Model Sharding
System Management
System Management
System Management
AIC-manager
AIC-manager
Architecture
Architecture
Glossary
Glossary
API
API
Python API
Python API
Inference API
Util API
CPP API
CPP API
InferenceSet IO Example
Features
Runtime
ONNXRT API
ONNXRT API
QAIC execution provider
FAQ
FAQ
Blogs
Blogs
Train anywhere, Infer on Qualcomm® Cloud AI 100
Accelerate Large Language Model Inference by ~2x Using Microscaling (Mx) Formats
Quadruple LLM Decoding Performance with Speculative Decoding (SpD) and Microscaling (MX) Formats
Extremely Low-Cost Text Embeddings on Qualcomm® Cloud AI 100 DL2q Instances
Accelerate Inference of Fully Transparent Open-Source LLMs from LLM360 on Qualcomm® Cloud AI 100 DL2q Instances
API
¶
Python API
C++ API
OnnxRT API
Back to top