aimet_torch.experimental.spinquant¶

Top level APIs

aimet_torch.experimental.spinquant.apply_spinquant(model)¶

Apply SpinQuant rotation transforms to a transformer-based language model.

SpinQuant applies orthogonal Hadamard rotations to model weights to reduce quantization error. This method modifies the model in-place by:

Fusing RMS normalization layers into subsequent linear layers
Applying R1 Hadamard rotations to embeddings, attention, and MLP layers
Merging all transforms into the weight matrices

Supported architectures:

LLaMA
Qwen2, Qwen3
Phi3
Qwen2.5-VL (Vision-Language Model)

Parameters:: model (Module) – A HuggingFace transformer model (e.g., LlamaForCausalLM, Qwen2ForCausalLM). The model must have untied embed_tokens and lm_head weights.
Raises:: RuntimeError – If embed_tokens and lm_head weights are tied.

Example

>>> from transformers import AutoModelForCausalLM
>>> from aimet_torch.experimental.spinquant import apply_spinquant
>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
>>> # Untie embedding and lm_head weights if they are tied
>>> old_weight = model.lm_head.weight
>>> model.lm_head.weight = torch.nn.Parameter(
...     old_weight.data.clone().detach().to(old_weight.device),
...     requires_grad=old_weight.requires_grad,
... )
>>> apply_spinquant(model)