Efficient Transformer Library - 1.20.0 Release Notes

Welcome to the official release of Efficient Transformer Library v1.20.0! This release brings a host of new model integrations, performance enhancements, and fine-tuning capabilities to accelerate your AI development.

✅ All features and models listed below are available on the release/1.20.0 branch and mainline.


Newly Supported Models


Key Features & Enhancements

  • Transformer Upgrade: Now using version 4.51.3

  • SpD & Multi-Projection Heads: Token speculation via post-attention projections

  • I/O Encryption: --io-encrypt flag support in compile/infer APIs

  • Separate Prefill/Decode Compilation: For disaggregated serving

  • On-Device Sampling: Supported using VLLM, which reduces host-device latency for CausalLM models


Embedding Model Upgrades

  • Flexible Pooling: Choose from standard or custom strategies

  • Sentence Embedding: Now runs directly on AI100

  • Multi-Seq Length Compilation: Auto-selects optimal graph at runtime


Fine-Tuning Support

  • BERT fine-tuning support with templates and documentation

  • Gradient checkpointing, device-aware GradScaler, and CLI --help added