Guided Decoding

Guided Decoding is a technique used in Natural language generation (especially with LLMs) to constrain or steer the output during the decoding process so that it follows certain rules, formats or conditions.

Normally models generate text token by token using strategies like “Greedy decoding”, beam search or sampling.

Guided Decoding adds constraints or guidance such as “Grammar constraints” “Keyword constraints” “Structural constraints” “Semantic constraints”.

Supported Backends

The following backends are supported.

Backends

How it works

Use cases

Comments

xgrammar

Json schema(limited), regex, EBNF grammar, choice, structural tags

Good for structed outputs

SUPPORTED

guidance (LL guidance)

Full Regex, JSON schema, EBNF grammar via Lark, choices, and structural tags

For complex constraints

NOT SUPPORTED

outlines

Regex, json schema, EBNF (via LARK), choices

Does not support structural tags

NOT SUPPORTED

Im-format-enforcer

Regex, choice

Lightweight does not support grammar, Json objects or structural tags

NOT SUPPORTED

faster-outlines

Same as outlines but CPU optimized

Optimized for speed on CPU

NOT SUPPORTED

Supported Models

  • Qwen 2.5 VL 32 B instruct

  • Llama-3.3-70B-instruct

Limitations

  • Currently, Guided Decoding is supported only on v0.8.5 branch and QEfficient version “release/v1.21.0”

  • Make sure vLLM installation section is using v0.8.5 and QEfficient version “release/v1.21.0”

Example

The example script for basic guided decoding test is located at,

cd vllm/examples/offline_inference/

This is how to run the script,

python3 examples/offline_inference/qaic_on_device_sampling.py

This has basic schema and tests guided decoding. Model details/BS/TS to be modified based on test requirements.