Glossary¶
- Accelerator¶
 A device with specialized processors such as GPUs dedicated to AI computation.
- Accuracy¶
 A measure of the percentage of correct predictions made by a model.
- Activation¶
 The output of a node’s activation function, passed as an input to the subsequent layer of the network.
- Activation Quantization¶
 The process of converting the output values (activations) of nodes from high precision (for example, 32-bit floating point) to lower precision (for example, 8-bit integer), reducing computation and memory requirements during inference.
- AdaRound¶
 A technique used to minimize quantization errors by carefully selecting how to round weights. AdaRound is especially powerful in retaining accuracy of models that undergo aggressive quantization.
- AI Model Efficiency Toolkit¶
 An open-source software library developed by the Qualcomm Innovation Center, providing a suite of quantization and compression technologies that reduce the computational load and memory usage of deep learning models.
- AIMET¶
 - AutoQuant¶
 A feature that automatically chooses optimal quantization parameters to automate the process of model quantization.
- Batch Normalization¶
 A technique for normalizing a layer’s input to accelerate the convergence of deep network models.
- BN¶
 - Batch Normalization Folding (BN Folding)¶
 A model optimization technique that merges Batch Normalization layers to eliminate the need to compute Batch Normalization during inference.
- CNN¶
 - Compression¶
 The process of reducing the memory footprint and computational requirements of a neural network.
- Convolutional Layer¶
 A model layer that contains a set of filters that interact with an input to create an activation map.
- Convolutional Neural Network¶
 A deep learning model that uses convolutional layers to extract features from input data, such as images.
- Device¶
 A portable computation platform such as a mobile phone or a laptop.
- DLF¶
 Dynamic Layer Fusion.
- Dynamic Layer Fusion¶
 A method for merging adjacent layers to decrease computational load during inference.
- Edge device¶
 A device at the “edge” of the network. Typically a personal computation device such as a mobile phone or a laptop.
- Encoding¶
 The representation of model parameters (weights) and activations in a compressed, quantized format. Different encoding schemes embody tradeoffs between model accuracy and efficiency.
- FP32¶
 32-bit floating-point precision, the default data type for representing weights and activations in most deep learning frameworks.
- Inference¶
 The process of employing a trained AI model for its intended purpose: prediction, classification, content generation, etc.
- INT8¶
 8-bit integer precision, commonly used by AIMET to reduce the memory size and computational demands during inference.
- KL Divergence¶
 Kullback-Leibler Divergence. A measure of the difference between two probability distributions. Used during quantization calibration to maintain a similar distribution of activations to the original floating-point model.
- Layer¶
 How nodes are organized in a model. The nodes in a layer are connected to the previous and subsequent layer via weights.
- Layer-wise quantization¶
 A quantization method where each layer is quantized independently. Used to achieve balance between model accuracy and computational efficiency by more aggressively compressing layers that have minimal impact on model performance.
- LoRA MobileNet¶
 A family of convolutional neural network architectures developed at Google optimized to operate efficiently with constrained computational resources.
- Model¶
 A computational structure made up of layers of nodes connected by weights.
- Neural Network Compression Framework¶
 Another compression and optimization toolkit similar to AIMET.
- Node¶
 A computation unit in a :model:`model`. Each node performs a mathematical function on an input to produce an output.
- Normalization¶
 Scaling a feature such as a layer to standardize the range of the feature.
- NNCF¶
 - ONNX¶
 - Open Neural Network Exchange¶
 An open-source format for the representation of neural network models across different AI frameworks.
- Per-channel Quantization¶
 A quantization method where each channel of a convolutional layer is quantized independently, reducing the quantization error compared to a global quantization scheme.
- Post-Training Quantization¶
 A technique for applying quantization to a neural network after it has been trained using full-precision data, avoiding the need for retraining.
- Pruning¶
 Systematically removing less important neurons, weights, or connections from a model.
- PTQ¶
 - PyTorch¶
 A open-source deep learning framework developed by Facebook’s AI Research lab (FAIR), widely used in research environments.
- QAT¶
 Quantization Aware Training.
- QDO¶
 Quantize and dequantize operations.
- Qualcomm Innovation Center¶
 A division of Qualcomm, Inc. responsible for developing advanced technologies and open-source projects, including AIMET.
- Quantization¶
 A model compression technique that reduces the bits used to represent each weight and activation in a neural network, typically from floating-point 32-bit numbers to 8-bit integers.
- Quantization-Aware Training¶
 A technique in which quantization is simulated throughout the training process so that the network adapts to the lower precision during training.
- Quantization Simulation¶
 A tool within AIMET that simulates the effects of quantization on a model to predict how quantization will affect the model’s performance.
- QuantSim¶
 - QUIC¶
 - Target Hardware Accelerator¶
 Specialized hardware designed to accelerate AI inference tasks. Examples include GPUs, TPUs, and custom ASICs, for example Qualcomm’s Cloud AI 100 inference accelerator.
- Target Runtime¶
 A model quantized for use on a low bitwidth platform, typically an edge device.
- TensorFlow¶
 A widely-used open-source deep learning framework developed by Google.
- TorchScript¶
 An intermediate representation for PyTorch models that enables running them independently of the Python environment, making them more suitable for production deployment.
- Variant¶
 The combination of machine learning framework (PyTorch, TensorFlow, or ONNX) and processor (Nvidia version or CPU) that determines which version of the AIMET API to install.
- Weights¶
 Parameters that collectively represent features in a model.