Efficient Transformer Library - 1.21.0 Release Notes
Welcome to the official release of Efficient Transformer Library v1.21.0! This release introduces advanced attention mechanisms, expanded model support, optimized serving capabilities, and significant improvements to fine-tuning and deployment workflows.
✅ All features and models listed below are available on the
release/v1.21.0branch andmainline.
Newly Supported Models
Flux (Diffusers - Image Generation)
Diffusion-based image generation model
WAN (Diffusers - Video Generation)
Wide-Area Network Lightning support for distributed inference
Qwen2.5-VL (Vision Language)
Executable via
QEFFAutoModelForImageTextToTextMulti-image prompt support
Continuous batching enabled
Mistral 3.1 (24B)
Executable via
QEFFAutoModelForImageTextToText
Disaggregated serving ready via vLLM GPT-OSS
Note: If running GPT-OSS models natively via vLLM, PR-685 of the qefficient library is required for Python 3.12 compatibility.
Executable via
QEffAutoModelForCausalLMSeparate prefill and decode compilation supported
Disaggregated serving ready
Olmo2
Executable via
QEffAutoModelForCausalLMFull CausalLM support with optimizations
Refer to Text generation Example Scripts for usage details.
Molmo
Executable via
QEffAutoModelForCausalLMMulti-modal capabilities
InternVL 3.5 Series
Executable via
QEffAutoModelForCausalLMFull Vision-Language support
Multi-image handling with continuous batching
Refer to InternVL 3.5 Example Scripts for usage details.
Qwen3-MOE (Mixture of Experts)
Executable via
QEffAutoModelForCausalLMEfficient expert routing
Wav2Vec2 (Audio)
Executable via
QEFFAutoModelForCTCSpeech recognition and audio feature extraction
Multilingual-e5-Large (Embedding Model)
Executable via
QEffAutoModelMultilingual text embedding capabilities
Refer usage details here.
Key Features & Enhancements
Framework Upgrades: Transformers
4.55, PyTorch2.7.0+cpu, Torchvision0.22.0+cpuPython Support: Requires Python
3.10ONNX Opset: Updated to version
17for broader operator supportAdvanced Attention: Flux blocking support, BlockedKV attention for CausalLM models
Diffusers Integration: Full support for diffuser-based image generation and video generation models
Compute-Context-Length (CCL) support: To optimize the throughput when handling very large context lengths
Prefill/Decode Separation: Support for GPT OSS using disaggregate serving models
Continuous Batching (VLMs): Extended to Vision Language Models with multi-image handling
Supported models: Llava, Llava_Next, Gemma3, Mistral3, InternVL2_5, InternVL3_5, Molmo
ONNX Sub-Functions: Feature enabling more efficient model compilation and execution on hardware. Users can enable the feature by passing
use_onnx_subfunctions=Trueduring exportMemory Profiling: Built-in utilities for optimization analysis
Extend on-device Sampling: Extend on-device sampling to dual QPC VLMs and Guided decoding for on-device sampling
ONNX transform, memory & time optimizations: Optimizations for faster ONNX Transform and reduced memory footprint
Removed platform SDK dependency: Support QPC generation on systems without the Platform SDK
Example Scripts Revamp: New example scripts for audio, embeddings, and image-text-to-text tasks
Onboarding Guide: Simplified setup and deployment process for new users
Organized examples into domain-specific subdirectories Examples
Embedding Model Upgrades
Multi-Sequence Length Support: Auto-selects optimal graph at runtime
Enhanced Pooling: Flexible pooling strategies for various embedding tasks
Fine-Tuning Support
Checkpoint Management: Resume from epochs with proper state restoration
Enhanced Loss Tracking: Corrected data type handling for accurate loss computation
Custom Dataset Support: Improved handling with better tokenization
Device-Aware Scaling: Optimized GradScaler for multi-device training
Comprehensive Testing: Unit tests for fine-tuning workflows
Efficient Transformer Library - 1.20.0 Release Notes
Welcome to the official release of Efficient Transformer Library v1.20.0! This release introduces advanced attention mechanisms, expanded model support, optimized serving capabilities, and significant improvements to fine-tuning and deployment workflows.
✅ All features and models listed below are available on the
release/v1.20.0branch andmainline.
Newly Supported Models
Llama-4-Scout-17B-16E-Instruct
Executable via
QEFFAutoModelForImageTextToTextText & Image+Text support
Chunk attention, Single/Dual QPC support
Multi-image prompts enabled via VLLM interface
Grok-1
Executable via
QEffAutoModelForCausalLM
Gemma3
Executable via
QEFFAutoModelForImageTextToTextText & Image+Text support
Sliding window support
SwiftKV (Llama-3.1-SwiftKV-8B-Instruct)
Executable via
QEffAutoModelForCausalLMSupports both continuous and non-continuous batching
GGUF Models
Executable via
QEffAutoModelForCausalLMExecution support (non-quantized)
FP8 Compressed Quantization
Support for
Llama-3.3-70B-Instruct-FP8-Dynamic
Key Features & Enhancements
Transformer Upgrade: Now using version
4.51.3SpD & Multi-Projection Heads: Token speculation via post-attention projections
I/O Encryption:
--io-encryptflag support in compile/infer APIsSeparate Prefill/Decode Compilation: For disaggregated serving
On-Device Sampling: Supported using VLLM, which reduces host-device latency for CausalLM models
Embedding Model Upgrades
Flexible Pooling: Choose from standard or custom strategies
Sentence Embedding: Now runs directly on AI100
Multi-Seq Length Compilation: Auto-selects optimal graph at runtime
Fine-Tuning Support
BERT fine-tuning support with templates and documentation
Gradient checkpointing, device-aware
GradScaler, and CLI--helpadded