Validated Models
Text-only Language Models
Text Generation Task
QEff Auto Class: QEFFAutoModelForCausalLM
Architecture |
Model Family |
Representative Models |
|
|---|---|---|---|
MolmoForCausalLM |
Molmo① |
✕ |
|
Olmo2ForCausalLM |
OLMo-2 |
✔️ |
|
FalconForCausalLM |
Falcon② |
✔️ |
|
Qwen3MoeForCausalLM |
Qwen3Moe |
✔️ |
|
GemmaForCausalLM |
CodeGemma |
✔️ |
|
Gemma③ |
google/gemma-2b |
✔️ |
|
GptOssForCausalLM |
GPT-OSS |
✔️ |
|
GPTBigCodeForCausalLM |
Starcoder1.5 |
✔️ |
|
Starcoder2 |
✔️ |
||
GPTJForCausalLM |
GPT-J |
✔️ |
|
GPT2LMHeadModel |
GPT-2 |
✔️ |
|
GraniteForCausalLM |
Granite 3.1 |
ibm-granite/granite-3.1-8b-instruct |
✔️ |
Granite 20B |
ibm-granite/granite-20b-code-base-8k |
✔️ |
|
InternVLChatModel |
Intern-VL① |
✔️ |
|
LlamaForCausalLM |
CodeLlama |
codellama/CodeLlama-7b-hf |
✔️ |
DeepSeek-R1-Distill-Llama |
✔️ |
||
InceptionAI-Adapted |
inceptionai/jais-adapted-7b |
✔️ |
|
Llama 3.3 |
✔️ |
||
Llama 3.2 |
✔️ |
||
Llama 3.1 |
✔️ |
||
Llama 3 |
✔️ |
||
Llama 2 |
meta-llama/Llama-2-7b-chat-hf |
✔️ |
|
Vicuna |
lmsys/vicuna-13b-delta-v0 |
✔️ |
|
MistralForCausalLM |
Mistral |
✔️ |
|
MixtralForCausalLM |
Codestral |
✔️ |
|
Phi3ForCausalLM |
Phi-3②, Phi-3.5② |
✔️ |
|
QwenForCausalLM |
DeepSeek-R1-Distill-Qwen |
✔️ |
|
Qwen2, Qwen2.5 |
✔️ |
||
LlamaSwiftKVForCausalLM |
swiftkv |
✔️ |
|
Grok1ModelForCausalLM |
grok-1② |
✕ |
Embedding Models
Text Embedding Task
QEff Auto Class: QEFFAutoModel
Architecture |
Model Family |
Representative Models |
vLLM Support |
|---|---|---|---|
BertModel |
BERT-based |
BAAI/bge-base-en-v1.5 |
✔️ |
MPNetForMaskedLM |
MPNet |
✔️ |
|
NomicBertModel |
NomicBERT② |
✕ |
|
RobertaModel |
RoBERTa |
ibm-granite/granite-embedding-30m-english |
✔️ |
XLMRobertaForSequenceClassification |
XLM-RoBERTa |
✔️ |
|
XLMRobertaModel |
XLM-RoBERTa |
ibm-granite/granite-embedding-107m-multilingual |
✔️ |
Multimodal Language Models
Vision-Language Models (Text + Image Generation)
QEff Auto Class: QEFFAutoModelForImageTextToText
Architecture |
Model Family |
Representative Models |
Qeff Single Qpc |
Qeff Dual Qpc |
vllm Single Qpc |
vllm Dual Qpc |
|---|---|---|---|---|---|---|
LlavaForConditionalGeneration |
LLaVA-1.5 |
✔️ |
✔️ |
✔️ |
✔️ |
|
MllamaForConditionalGeneration |
Llama 3.2 |
meta-llama/Llama-3.2-11B-Vision Instruct |
✔️ |
✔️ |
✔️ |
✔️ |
LlavaNextForConditionalGeneration |
Granite Vision |
✕ |
✔️ |
✕ |
✔️ |
|
Llama4ForConditionalGeneration |
Llama-4-Scout |
✔️ |
✔️ |
✔️ |
✔️ |
|
Gemma3ForConditionalGeneration |
Gemma3③ |
✔️ |
✔️ |
✕ |
✕ |
|
Qwen2_5_VLForConditionalGeneration |
Qwen2.5-VL |
✔️ |
✔️ |
✕ |
✔️ |
|
Mistral3ForConditionalGeneration |
Mistral3 |
✕ |
✔️ |
✕ |
✕ |
Dual QPC: In the Dual QPC(Qualcomm Program Container) setup, the model is split across two configurations:
The Vision Encoder runs in one QPC.
The Language Model (responsible for output generation) runs in a separate QPC.
The outputs from the Vision Encoder are transferred to the Language Model.
The dual QPC approach introduces the flexibility to run the vision and language components independently.
Single QPC: In the single QPC(Qualcomm Program Container) setup, the entire model—including both image encoding and text generation—runs within a single QPC. There is no model splitting, and all components operate within the same execution environment.
Note
The choice between Single and Dual QPC is determined during model instantiation using the kv_offload setting.
If the kv_offload is set to True it runs in dual QPC and if its set to False model runs in single QPC mode.
Audio Models
(Automatic Speech Recognition) - Transcription Task
QEff Auto Class: QEFFAutoModelForSpeechSeq2Seq
Architecture |
Model Family |
Representative Models |
vLLM Support |
|---|---|---|---|
Whisper |
Whisper |
openai/whisper-tiny |
✔️ |
Wav2Vec2 |
Wav2Vec2 |
Diffusion Models
Image Generation Models
QEff Auto Class: QEffFluxPipeline
Architecture |
Model Family |
Representative Models |
vLLM Support |
|---|---|---|---|
FluxPipeline |
FLUX.1 |
Video Generation Models
QEff Auto Class: QEffWanPipeline
Architecture |
Model Family |
Representative Models |
vLLM Support |
|---|---|---|---|
WanPipeline |
Wan2.2 |
Note
① Intern-VL and Molmo models are Vision-Language Models but use QEFFAutoModelForCausalLM for inference to stay compatible with HuggingFace Transformers.
② Set trust_remote_code=True for end-to-end inference with vLLM.
③ Pass disable_sliding_window for few family models when using vLLM.
Models Coming Soon
Architecture |
Model Family |
Representative Models |
|---|---|---|
NemotronHForCausalLM |
NVIDIA Nemotron v3 |
|
Sam3Model |
facebook/sam3 |
|
StableDiffusionModel |
HiDream-ai |
|
MistralLarge3Model |
Mistral Large 3 |