Cheatsheet
About 1068 wordsAbout 4 min
2026-03-21
Quick reference for installation, all key flags, server API endpoints, Python bindings, and common recipes.
Installation
# Build from source (CPU)
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build && cmake --build build --config Release -j$(nproc)
# Build with CUDA
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
# Build with Metal (macOS — auto-detected)
cmake -B build && cmake --build build --config Release -j$(nproc)
# Build with Vulkan
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)
# Python bindings (CPU)
pip install llama-cpp-python
# Python bindings (CUDA)
CMAKE_ARGS="-DGGML_CUDA=ON" pip install llama-cpp-python
# Download a model
pip install huggingface-hub
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf --local-dir ./modelsBuild Flags Quick Reference
| Flag | When to Use |
|---|---|
-DGGML_CUDA=ON | NVIDIA GPU |
-DGGML_METAL=ON | Force Metal (auto on macOS) |
-DGGML_VULKAN=ON | Cross-platform GPU (AMD/Intel/NVIDIA) |
-DGGML_HIP=ON | AMD GPU via ROCm |
-DGGML_BLAS=ON | CPU BLAS acceleration |
-DGGML_NATIVE=ON | Optimize for host CPU (-march=native) |
-DCMAKE_CUDA_ARCHITECTURES="native" | CUDA: auto-detect local GPU |
-DAMDGPU_TARGETS="gfx1100" | ROCm: specify AMD GPU arch |
llama-cli Essential Flags
Core
| Flag | Description |
|---|---|
-m FILE | Model path (GGUF) |
-p TEXT | Prompt text |
-f FILE | Prompt from file |
-n N | Max tokens to generate (-1 = unlimited) |
-c N | Context size |
-sp TEXT | System prompt |
GPU
| Flag | Description |
|---|---|
-ngl N | GPU layers (99 = all) |
--split-mode {layer,row} | Multi-GPU strategy |
--tensor-split A,B | VRAM ratio per GPU |
--main-gpu N | Primary GPU index |
Performance
| Flag | Description |
|---|---|
-t N | CPU threads (= physical cores) |
-tb N | CPU batch threads (= logical cores) |
-b N | Logical batch size |
-ub N | Micro batch size |
-fa | Flash Attention |
--mlock | Pin model in RAM |
--no-mmap | Load fully into RAM |
KV Cache
| Flag | Description |
|---|---|
--cache-type-k TYPE | Key cache type (f16, q8_0, q4_0) |
--cache-type-v TYPE | Value cache type |
--keep N | Tokens to keep on context shift |
Sampling
| Flag | Default | Description |
|---|---|---|
--temp N | 0.8 | Temperature |
--top-p N | 0.9 | Nucleus sampling |
--top-k N | 40 | Top-K sampling |
--repeat-penalty N | 1.0 | Repetition penalty |
-s N | -1 | Random seed |
Chat / Interactive
| Flag | Description |
|---|---|
-cnv | Conversation mode (auto chat template) |
-i | Interactive mode (raw) |
--chat-template NAME | Template: llama3, chatml, mistral, gemma |
Structured Output
| Flag | Description |
|---|---|
--grammar-file FILE | GBNF grammar file |
--json-schema JSON | JSON schema string |
Context & RoPE
| Flag | Description |
|---|---|
-c N | Context window size |
--rope-freq-base N | RoPE base freq (e.g. 500000 for Llama-3.1) |
--rope-scaling TYPE | linear, yarn, ntk |
--prompt-cache FILE | Save/load KV cache |
Advanced
| Flag | Description |
|---|---|
--lora FILE | Apply LoRA adapter |
--lora-scaled FILE SCALE | LoRA with scale (0.0–1.0) |
-r TEXT | Stop on reverse prompt |
--verbose | Verbose logging |
llama-server Additional Flags
| Flag | Description |
|---|---|
--host ADDR | Bind address (0.0.0.0 for network) |
--port N | Port (default: 8080) |
--api-key KEY | Require Bearer auth |
-np N | Parallel slots |
-cb | Continuous batching |
--mmproj FILE | Multimodal projector |
--embedding | Enable embedding endpoint |
--log-disable | Suppress server logs |
Server API Quick Reference
| Method | Endpoint | Description |
|---|---|---|
| GET | /health | Server status |
| GET | /v1/models | List models |
| POST | /v1/chat/completions | OpenAI chat API |
| POST | /v1/completions | OpenAI text completions |
| POST | /v1/embeddings | Embeddings |
| POST | /completion | Native completion (grammar, etc.) |
| GET | /slots | Slot status |
| GET | /lora-adapters | List loaded LoRAs |
Minimal Curl Examples
# Chat
curl localhost:8080/v1/chat/completions \
-d '{"messages":[{"role":"user","content":"Hi"}]}'
# Embed
curl localhost:8080/v1/embeddings \
-d '{"input":"text to embed"}'
# Stream chat
curl localhost:8080/v1/chat/completions \
-d '{"messages":[{"role":"user","content":"Count to 5"}],"stream":true}'
# Grammar-constrained
curl localhost:8080/completion \
-d '{"prompt":"List 3 colors as JSON array:","grammar":"root ::= \"[\" items \"]\"\nitems ::= string | string \",\" items\nstring ::= \"\\\"\"\ [a-z]+ \"\\\"\""}'Python Bindings Quick Reference
from llama_cpp import Llama
# Load
llm = Llama(model_path="model.gguf", n_ctx=4096, n_gpu_layers=99, verbose=False)
# Complete
out = llm("Prompt here", max_tokens=128, temperature=0.7)
print(out["choices"][0]["text"])
# Chat
out = llm.create_chat_completion(
messages=[{"role":"user","content":"Hello"}],
max_tokens=128
)
print(out["choices"][0]["message"]["content"])
# Stream chat
for chunk in llm.create_chat_completion(messages=[...], stream=True):
print(chunk["choices"][0]["delta"].get("content",""), end="")
# Embed
embed = Llama(model_path="embed.gguf", embedding=True)
vec = embed.create_embedding("text")["data"][0]["embedding"]
# Tokenize
tokens = llm.tokenize(b"hello world")
text = llm.detokenize(tokens)Quantization Quick Reference
| Type | Bits | 7B Size | Quality | Use When |
|---|---|---|---|---|
F16 | 16 | 14 GB | Reference | VRAM/RAM not constrained |
Q8_0 | 8 | 7.2 GB | Near-perfect | Highest quality with size savings |
Q6_K | 6.57 | 5.9 GB | Excellent | Quality-focused |
Q5_K_M | 5.68 | 5.1 GB | Very good | Quality + size balance |
Q4_K_M | 4.85 | 4.4 GB | Good | Default recommendation |
Q4_K_S | 4.58 | 4.1 GB | Good | Slightly smaller than Q4_K_M |
Q3_K_M | 3.74 | 3.3 GB | Fair | RAM < 6 GB |
IQ4_XS | 4.25 | 3.9 GB | Good | Better than Q4_K_S |
Q2_K | 2.94 | 2.7 GB | Poor | Extreme space constraints only |
Chat Templates by Model Family
| Template | Models |
|---|---|
llama3 | Llama-3.x Instruct |
llama2 | Llama-2-chat |
chatml | Qwen, Yi, Phi-3, InternLM |
mistral | Mistral/Mixtral Instruct |
gemma | Gemma Instruct |
deepseek2 | DeepSeek-V2/V3 |
command-r | Command-R, Command-R+ |
phi3 | Phi-3 Mini/Medium |
Performance Checklist
Useful Utility Commands
# Quantize a model
./build/bin/llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
# Inspect a GGUF file
./build/bin/llama-gguf-info -m model.gguf
# Benchmark
./build/bin/llama-bench -m model.gguf -ngl 0,99 -t 4,8 -r 3
# Create importance matrix
./build/bin/llama-imatrix -m model-f16.gguf -f calibration.txt -o imatrix.dat
# Convert HuggingFace model
python convert_hf_to_gguf.py /hf-model-dir --outfile model.gguf --outtype f16
# Convert LoRA adapter
python convert_lora_to_gguf.py --base /hf-base --lora /peft-lora --outfile lora.ggufSee Also
- Introduction — what llama.cpp is and why you need it
- Architecture — GGML, compute graph, backend system
- GGUF & Quantization — format details and quant types
- Installation & Build — building from source
- CLI Usage — complete
llama-clireference - Server — complete
llama-serverand REST API reference - Python Bindings —
llama-cpp-pythonfull guide - GPU Acceleration — CUDA, Metal, Vulkan, ROCm
- Performance Tuning — systematic optimization guide
- Advanced Features — grammars, vision, LoRA, speculative decoding
- Troubleshooting — error reference