Efficient Transformer Library - 1.21.0 Release Notes

Welcome to the official release of Efficient Transformer Library v1.21.0! This release introduces advanced attention mechanisms, expanded model support, optimized serving capabilities, and significant improvements to fine-tuning and deployment workflows.

✅ All features and models listed below are available on the release/v1.21.0 branch and mainline.


Newly Supported Models


Key Features & Enhancements

  • Framework Upgrades: Transformers 4.55, PyTorch 2.7.0+cpu, Torchvision 0.22.0+cpu

  • Python Support: Requires Python 3.10

  • ONNX Opset: Updated to version 17 for broader operator support

  • Advanced Attention: Flux blocking support, BlockedKV attention for CausalLM models

  • Diffusers Integration: Full support for diffuser-based image generation and video generation models

  • Compute-Context-Length (CCL) support: To optimize the throughput when handling very large context lengths

  • Prefill/Decode Separation: Support for GPT OSS using disaggregate serving models

  • Continuous Batching (VLMs): Extended to Vision Language Models with multi-image handling

    • Supported models: Llava, Llava_Next, Gemma3, Mistral3, InternVL2_5, InternVL3_5, Molmo

  • ONNX Sub-Functions: Feature enabling more efficient model compilation and execution on hardware. Users can enable the feature by passing use_onnx_subfunctions=True during export

  • Memory Profiling: Built-in utilities for optimization analysis

  • Extend on-device Sampling: Extend on-device sampling to dual QPC VLMs and Guided decoding for on-device sampling

  • ONNX transform, memory & time optimizations: Optimizations for faster ONNX Transform and reduced memory footprint

  • Removed platform SDK dependency: Support QPC generation on systems without the Platform SDK

  • Example Scripts Revamp: New example scripts for audio, embeddings, and image-text-to-text tasks

  • Onboarding Guide: Simplified setup and deployment process for new users

  • Organized examples into domain-specific subdirectories Examples


Embedding Model Upgrades

  • Multi-Sequence Length Support: Auto-selects optimal graph at runtime

  • Enhanced Pooling: Flexible pooling strategies for various embedding tasks


Fine-Tuning Support

  • Checkpoint Management: Resume from epochs with proper state restoration

  • Enhanced Loss Tracking: Corrected data type handling for accurate loss computation

  • Custom Dataset Support: Improved handling with better tokenization

  • Device-Aware Scaling: Optimized GradScaler for multi-device training

  • Comprehensive Testing: Unit tests for fine-tuning workflows


Efficient Transformer Library - 1.20.0 Release Notes

Welcome to the official release of Efficient Transformer Library v1.20.0! This release introduces advanced attention mechanisms, expanded model support, optimized serving capabilities, and significant improvements to fine-tuning and deployment workflows.

✅ All features and models listed below are available on the release/v1.20.0 branch and mainline.


Newly Supported Models


Key Features & Enhancements

  • Transformer Upgrade: Now using version 4.51.3

  • SpD & Multi-Projection Heads: Token speculation via post-attention projections

  • I/O Encryption: --io-encrypt flag support in compile/infer APIs

  • Separate Prefill/Decode Compilation: For disaggregated serving

  • On-Device Sampling: Supported using VLLM, which reduces host-device latency for CausalLM models


Embedding Model Upgrades

  • Flexible Pooling: Choose from standard or custom strategies

  • Sentence Embedding: Now runs directly on AI100

  • Multi-Seq Length Compilation: Auto-selects optimal graph at runtime


Fine-Tuning Support

  • BERT fine-tuning support with templates and documentation

  • Gradient checkpointing, device-aware GradScaler, and CLI --help added