Supported Features
Feature |
Impact |
|---|---|
Full support for diffuser-based image generation models like Stable Diffusion, Imagen, Videogen enabling efficient image and video synthesis tasks. |
|
Enabled for GPT-OSS models, allowing for flexible deployment of large language models across different hardware configurations. |
|
Feature enabling more efficient model compilation and execution on hardware. |
|
Implements a blocked K/V cache layout so attention reads/processes the cache blockbyblock, improving longcontext decode performance. |
|
Optimizes inference by using different context lengths during prefill and decode phases, reducing memory footprint and computation for shorter sequences while maintaining support for longer contexts. Supports both text-only and vision-language models. Refer sample script for more details. |
|
Sentence embedding, Flexible Pooling configuration and compilation with multiple sequence lengths |
Supports standard/custom pooling with AI 100 acceleration and sentence embedding. Enables efficient sentence embeddings via Efficient-Transformers. Compile with one or multiple seq_len; optimal graph auto-selected at runtime. Refer sample script for more details. |
Implemented post-attention hidden size projections to speculate tokens ahead of the base model. Refer sample script for more details. |
|
Enabled for AutoModel classes QNN compilation capabilities for multi-models, embedding models and causal models. |
|
It support for separate prefill and decode compilation for encoder (vision) and language models. |
|
Supported GGUF model execution (without quantized weights). Refer sample script for more details. |
|
Replication of KV |
Enabled FP8 model support on replicate_kv_heads script. |
Supports gradient checkpointing in the finetuning script |
|
Reduces computational overhead during inference by optimizing key-value pair processing, leading to improved throughput. Support for both continuous and non-continuous batching execution in SwiftKV |
|
Provides support for the AutoModelForImageTextToText class from the transformers library, enabling advanced vision-language tasks. Refer sample script for more details. |
|
Provides support for the QEFFAutoModelForSpeechSeq2Seq Facilitates speech-to-text sequence models. Refer sample script for more details. |
|
Support for FP8 Execution |
Enables execution with FP8 precision, significantly improving performance and reducing memory usage for computational tasks. |
Prefill caching |
Enhances inference speed by caching key-value pairs for shared prefixes, reducing redundant computations and improving efficiency. |
On Device Sampling |
Enables sampling operations to be executed directly on the QAIC device rather than the host CPU for QEffForCausalLM models. This enhancement significantly reduces host-device communication overhead and improves inference throughput and scalability. Refer sample script for more details. |
Prompt-Lookup Decoding |
Speeds up text generation by using overlapping parts of the input prompt and the generated text, making the process faster without losing quality. Refer sample script for more details. |
Enables parameter-efficient fine-tuning using low-rank adaptation techniques, reducing the computational and memory requirements for fine-tuning large models. Refer sample script for more details. |
|
Enables compilation using QNN SDK, making Qeff adaptable for various backends in the future. |
|
Facilitates the generation of vector embeddings for retrieval tasks. |
|
Accelerates text generation by using a draft model to generate preliminary predictions, which are then verified by the target model, reducing latency and improving efficiency. Refer sample script for more details. |
|
Users can activate multiple LoRA adapters and compile them with the base model. At runtime, they can specify which prompt should use which adapter, enabling mixed adapter usage within the same batch. Refer sample script for more details. |
|
Python and CPP Inferencing API support |
Provides flexibility while running inference with Qeff and enabling integration with various applications and improving accessibility for developers. Refer sample script for more details. |
Optimizes throughput and latency by dynamically batching requests, ensuring efficient use of computational resources. |
|
AWQ and GPTQ support |
Supports advanced quantization techniques, improving model efficiency and performance on AI 100. |
Support serving successive requests in same session |
An API that yields tokens as they are generated, facilitating seamless integration with various applications and enhancing accessibility for developers. |
Perplexity calculation |
A script for computing the perplexity of a model, allowing for the evaluation of model performance and comparison across different models and datasets. Refer sample script for more details. |
KV Heads Replication Script |
A sample script for replicating key-value (KV) heads for the Llama-3-8B-Instruct model, running inference with the original model, replicating KV heads, validating changes, and exporting the modified model to ONNX format. Refer sample script for more details. |