Guided Decoding¶

Guided Decoding is a technique used in Natural language generation (especially with LLMs) to constrain or steer the output during the decoding process so that it follows certain rules, formats or conditions.

Normally models generate text token by token using strategies like “Greedy decoding”, beam search or sampling.

Guided Decoding adds constraints or guidance such as “Grammar constraints” “Keyword constraints” “Structural constraints” “Semantic constraints”.

Supported Backends¶

The following backends are supported.

Backends	How it works	Use cases	Comments
xgrammar	Json schema(limited), regex, EBNF grammar, choice, structural tags	Good for structed outputs	SUPPORTED
guidance (LL guidance)	Full Regex, JSON schema, EBNF grammar via Lark, choices, and structural tags	For complex constraints	NOT SUPPORTED
outlines	Regex, json schema, EBNF (via LARK), choices	Does not support structural tags	NOT SUPPORTED
Im-format-enforcer	Regex, choice	Lightweight does not support grammar, Json objects or structural tags	NOT SUPPORTED
faster-outlines	Same as outlines but CPU optimized	Optimized for speed on CPU	NOT SUPPORTED

Supported Models¶

Qwen 2.5 VL 32 B instruct
Llama-3.3-70B-instruct

Limitations¶

Currently, Guided Decoding is supported only on v0.8.5 branch and QEfficient version “release/v1.21.0”
Make sure vLLM installation section is using v0.8.5 and QEfficient version “release/v1.21.0”

Example¶

The example script for basic guided decoding test is located at,

cd vllm/examples/offline_inference/

This is how to run the script,

python3 examples/offline_inference/qaic_on_device_sampling.py

This has basic schema and tests guided decoding. Model details/BS/TS to be modified based on test requirements.