vLLM supported models¶
Text-only Language Models¶
Architecture |
Model Family |
Representative Models |
vLLM Support |
|---|---|---|---|
MolmoForCausalLM |
Molmo |
allenai/Molmo-7B-D-0924 |
✕ |
Olmo2ForCausalLM |
OLMo-2 |
allenai/OLMo-2-0425-1B |
✔️ |
FalconForCausalLM |
Falcon |
tiiuae/falcon-40b |
✔️ |
Qwen3MoeForCausalLM |
Qwen3Moe |
Qwen/Qwen3-30B-A3B-Instruct-2507 |
✔️ |
GemmaForCausalLM |
CodeGemma / Gemma |
google/codegemma-2b, google/gemma-7b |
✔️ |
GptOssForCausalLM |
GPT-OSS |
openai/gpt-oss-20b |
✔️ |
GPTBigCodeForCausalLM |
StarCoder |
bigcode/starcoder |
✔️ |
GPTJForCausalLM |
GPT-J |
EleutherAI/gpt-j-6b |
✔️ |
GraniteForCausalLM |
Granite |
ibm-granite/granite-3.1-8b-instruct |
✔️ |
LlamaForCausalLM |
LLaMA |
meta-llama/Llama-3.1-8B |
✔️ |
MistralForCausalLM |
Mistral |
mistralai/Mistral-7B-Instruct-v0.1 |
✔️ |
Embedding Models¶
Architecture |
Model Family |
Representative Models |
vLLM Support |
|---|---|---|---|
BertModel |
BERT-based |
BAAI/bge-base-en-v1.5 |
✔️ |
MPNetForMaskedLM |
MPNet |
sentence-transformers/multi-qa-mpnet-base-cos-v1 |
✔️ |
NomicBertModel |
NomicBERT |
nomic-ai/nomic-embed-text-v1.5 |
✕ |
RobertaModel |
RoBERTa |
ibm-granite/granite-embedding-125m-english |
✔️ |
XLMRobertaModel |
XLM-RoBERTa |
ibm-granite/granite-embedding-278m-multilingual |
✔️ |
Vision-Language Models¶
Architecture |
Model Family |
Representative Models |
vLLM Single QPC |
vLLM Dual QPC |
|---|---|---|---|---|
LlavaForConditionalGeneration |
LLaVA-1.5 |
llava-hf/llava-1.5-7b-hf |
✔️ |
✔️ |
MllamaForConditionalGeneration |
LLaMA-3.2-Vision |
meta-llama/Llama-3.2-11B-Vision-Instruct |
✔️ |
✔️ |
LlavaNextForConditionalGeneration |
Granite Vision |
ibm-granite/granite-vision-3.2-2b |
✕ |
✔️ |
Qwen2_5_VLForConditionalGeneration |
Qwen2.5-VL |
Qwen/Qwen2.5-VL-3B-Instruct |
✕ |
✔️ |
Gemma3ForConditionalGeneration |
Gemma-3 |
google/gemma-3-4b-it |
✕ |
✕ |
Notes¶
A check mark (✔️) indicates validated vLLM support.
A cross (✕) indicates that vLLM support is not validated.
Single QPC and Dual QPC indicate different Qualcomm Program Container (QPC) deployment modes.
Audio Models (Speech-to-Text)¶
Architecture |
Model Family |
Representative Models |
vLLM Support |
|---|---|---|---|
WhisperForConditionalGeneration |
Whisper |
openai/whisper-tiny |
✔️ |
WhisperForConditionalGeneration |
Whisper |
openai/whisper-base |
✔️ |
WhisperForConditionalGeneration |
Whisper |
openai/whisper-small |
✔️ |
WhisperForConditionalGeneration |
Whisper |
openai/whisper-medium |
✔️ |
WhisperForConditionalGeneration |
Whisper |
openai/whisper-large-v2 |
✔️ |
WhisperForConditionalGeneration |
Whisper |
openai/whisper-large-v3-turbo |
✔️ |
Notes¶
A check mark (✔️) indicates validated vLLM support.
A cross (✕) indicates that vLLM support is not validated.
Audio models are supported under the Speech-to-Text task using the Whisper architecture.
For more details, see: Validated Models and vLLM Support — Efficient Transformers Documentation