Quantization¶
Qualcomm Cloud AI accelerators support different quantization techniques. A quantized model is a neural network whose weights (and sometimes activations) are compressed into a lower precision numerical format such as INT8, FP8, or MXFP6.
The purpose is:
- Reduce DRAM footprint.
- Reduce memory bandwidth requirements.
- Increase throughput.
- Improve deployment efficiency on accelerators.
MXFP6¶
- Compress FP32/FP16 → MXFP6 during offline compile phase.
- Saves weight size (DRAM) and bandwidth.
- Automatic decompression happens in hardware/software with minimal overhead.
- Compute remain in FP16 for accuracy stability.
MXINT8¶
- Designed to reduce cross-card communication latency.
- FP16 activations can often be directly cast to MXINT8 with negligible accuracy loss.
- Useful for multi-card LLM scaling.
AWQ¶
- Upscale AWQ quantized weights to FP32.
- Re quantize using MXFP6 during compile.