Advanced Features
About 1124 wordsAbout 4 min
2026-03-21
This guide covers capabilities beyond basic text generation: constrained output via grammars, vision/multimodal models, embeddings, LoRA adapters, speculative decoding, and extended context via context shifting.
Grammar-Constrained Generation
llama.cpp can constrain model output to match a formal grammar, ensuring the model always produces valid JSON, SQL, or any custom format — regardless of which model you use.
GBNF Grammar Format
Grammars are defined in GBNF (GGML BNF), a variant of Backus-Naur Form:
# Grammar for a name-age JSON object
root ::= "{" ws name-kv "," ws age-kv "}"
name-kv ::= ""name"" ws ":" ws string
age-kv ::= ""age"" ws ":" ws number
string ::= """ [a-zA-Z ]+ """
number ::= [0-9]+
ws ::= [ \t\n]*Built-in grammars (in the grammars/ directory):
| File | Constrains output to |
|---|---|
json.gbnf | Any valid JSON value |
json_arr.gbnf | JSON array of objects |
list.gbnf | Markdown bullet list |
chess.gbnf | UCI chess moves |
c.gbnf | C code |
arithmetic.gbnf | Arithmetic expressions |
Using Grammars via CLI
# Use built-in JSON grammar
llama-cli -m model.gguf \
-p "List 3 European capitals as JSON array of objects with name and country:" \
--grammar-file grammars/json_arr.gbnf
# Custom grammar file
llama-cli -m model.gguf \
-p "Generate a task:" \
--grammar-file my_task.gbnfJSON Schema Mode (Easiest)
Instead of writing a grammar, provide a JSON Schema directly:
llama-cli -m model.gguf \
-p "Extract: Alice is 34, senior engineer at Acme Corp" \
--json-schema '{"type":"object","properties":{"name":{"type":"string"},"age":{"type":"integer"},"title":{"type":"string"},"company":{"type":"string"}},"required":["name","age"]}'Via the Server API
curl http://localhost:8080/completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Generate a color in hex format:",
"grammar": "root ::= \"#\" [0-9a-fA-F]{6}"
}'Or with JSON schema:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role":"user","content":"Generate 3 user records"}],
"response_format": {
"type": "json_object",
"schema": {
"type":"array",
"items":{"type":"object","properties":{"id":{"type":"integer"},"name":{"type":"string"},"email":{"type":"string"}}}
}
}
}'Multimodal: LLaVA and Vision Models
llama.cpp supports LLaVA (Large Language and Vision Assistant) and similar multimodal architectures that combine a vision encoder with a language model.
Supported Multimodal Models
| Model | Type | Notes |
|---|---|---|
| LLaVA-1.5 | Image + text | Original architecture |
| LLaVA-1.6 / LLaVA-NeXT | Image + text | Improved tiles |
| MobileVLM | Lightweight | Mobile-optimized |
| Qwen2-VL | Image + text | Strong OCR capability |
| InternVL2 | Image + text | High benchmark scores |
| SmolVLM | 256M–2B | Very small and fast |
File Requirements
Multimodal models require two files:
- Language model GGUF (
model.gguf) - Multimodal projector GGUF (
mmproj-model.gguf) — encodes images into the language model space
# Download both files for LLaVA-1.6
huggingface-cli download mys/ggml_llava-v1.5-7b \
ggml-model-q4_k.gguf \
mmproj-model-f16.ggufUsing llava-cli
./build/bin/llava-cli \
-m ggml-model-q4_k.gguf \
--mmproj mmproj-model-f16.gguf \
--image photo.jpg \
-p "Describe what you see in this image"Multimodal via Server API
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<BASE64_DATA>"}}
]
}]
}'Start the server with both model files:
llama-server \
-m ggml-model-q4_k.gguf \
--mmproj mmproj-model-f16.gguf \
--port 8080Embeddings
Embeddings convert text into dense numeric vectors for semantic search, RAG pipelines, and clustering.
Recommended Embedding Models
| Model | Dimensions | Notes |
|---|---|---|
nomic-embed-text-v1.5 | 768 | Good balance; 8192 token ctx |
mxbai-embed-large-v1 | 1024 | High quality; 512 token ctx |
all-MiniLM-L6-v2 | 384 | Fast and small |
gte-large | 1024 | Strong multilingual |
bge-m3 | 1024 | Multilingual, supports 8192 ctx |
CLI: llama-embedding
./build/bin/llama-embedding \
-m nomic-embed-text-v1.5-q4_k_m.gguf \
--log-disable \
-p "The quick brown fox"Server API: POST /v1/embeddings
curl http://localhost:8080/v1/embeddings \
-d '{"input": "text to embed", "model": "nomic-embed-text"}'Start embedding server:
llama-server -m nomic-embed-text-v1.5-q4_k_m.gguf --embeddingNormalization
Always L2-normalize embedding vectors before similarity comparison:
import numpy as np
def normalize(v):
return v / np.linalg.norm(v)
a = normalize(np.array(embedding_a))
b = normalize(np.array(embedding_b))
similarity = np.dot(a, b) # cosine similarityLoRA Adapters
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique. You can load LoRA adapters on top of a base GGUF model to get fine-tuned behavior without a separate full model.
Runtime LoRA Application
# Apply a single LoRA adapter
llama-cli -m base-model.gguf \
--lora lora-adapter.gguf \
-p "Continue the story..."
# Apply with custom scale (1.0 = full strength)
llama-cli -m base-model.gguf \
--lora lora-adapter.gguf \
--lora-scaled 0.7
# Stack multiple LoRA adapters
llama-cli -m base-model.gguf \
--lora lora-style.gguf \
--lora lora-persona.ggufConverting LoRA Weights
# Convert a HuggingFace PEFT LoRA to GGUF
python convert_lora_to_gguf.py \
--base /path/to/base-hf-model \
--lora /path/to/peft-lora \
--outfile lora-adapter.ggufVia Server API
curl http://localhost:8080/v1/chat/completions \
-d '{
"messages": [{"role":"user","content":"Write a Hemingway-style short story opening"}],
"lora": [{"id": 0, "scale": 1.0}]
}'List available LoRAs pre-loaded in the server with GET /lora-adapters.
Speculative Decoding
Speculative decoding uses a small, fast draft model to predict multiple tokens ahead, which the large target model then verifies in parallel. When the draft is correct, multiple tokens are emitted per step — giving effectively more than 1 token per target model step.
How It Works
- Draft model generates N candidate tokens cheaply
- Target model evaluates all N candidates in a single parallel pass
- For each correct candidate, accept + advance; reject the first mismatch
- Repeat
Expected speedup: 1.5–3× for well-matched model pairs.
Using llama-speculative
./build/bin/llama-speculative \
-m target-llama-3.1-70b-q4.gguf \
--model-draft draft-llama-3.2-1b-q8.gguf \
--n-gpu-layers 80 \
--draft-n-gpu-layers 32 \
-p "Write a technical blog post about Rust async" \
-n 512Draft Model Selection
| Target Model | Recommended Draft | Expected Speedup |
|---|---|---|
| Llama-3.1-70B | Llama-3.2-1B-Instruct | 1.5–2× |
| Llama-3.1-8B | Llama-3.2-1B-Instruct | 1.3–1.8× |
| Mixtral-8x7B | Mistral-7B | 1.4–1.9× |
Draft model must be from the same model family to have useful token alignment.
Context Shifting (Infinite Context Approximation)
When the context fills up, llama.cpp can shift the window forward, discarding old tokens while retaining recent context:
llama-cli -m model.gguf \
--ctx-size 4096 \
--keep 256 \ # Keep first 256 tokens (typically the system prompt)
-i--keep N preserves the first N tokens permanently. When the context fills, older tokens beyond the keep region are dropped, and the window slides forward.
This allows effectively infinite-length conversations at the cost of some memory loss.
See Also
- CLI Usage — all flags including
--grammar-file,--json-schema,--lora - Server — grammar/JSON schema and LoRA via REST API
- Python Bindings — JSON schema and grammar via
llama-cpp-python - Performance Tuning — optimizing the inference engine
- GGUF & Quantization — converting LoRA adapters to GGUF