CLI Usage
About 979 wordsAbout 3 min
2026-03-21
llama-cli is the command-line inference binary. It handles one-shot prompts, interactive chat sessions, prompt caching, and structured output — all without running a server.
Quick Reference
# Single prompt, print response, exit
./build/bin/llama-cli -m model.gguf -p "Explain what a tensor is"
# Interactive chat session (conversational)
./build/bin/llama-cli -m model.gguf -cnv
# Read prompt from file
./build/bin/llama-cli -m model.gguf -f prompt.txt
# GPU-accelerated inference
./build/bin/llama-cli -m model.gguf -p "Hello" --n-gpu-layers 99Core Flags
Model and Input
| Flag | Short | Description |
|---|---|---|
--model | -m | Path to GGUF model file (required) |
--prompt | -p | Prompt string |
--file | -f | Read prompt from file |
--system-prompt | -sp | System prompt text |
--in-prefix | Prefix appended before every user input | |
--in-suffix | Suffix appended after every user input | |
--reverse-prompt | -r | Stop and return control on this string |
Output Control
| Flag | Short | Description |
|---|---|---|
--n-predict | -n | Max tokens to generate (-1 = infinite) |
--no-display-prompt | Suppress prompt echo on non-interactive mode | |
--log-disable | Disable progress/status output | |
--verbose | -v | Verbose logging to stderr |
Context and Memory Flags
| Flag | Description |
|---|---|
--ctx-size N / -c N | Context window size (default: 4096 or model max) |
--rope-scaling TYPE | RoPE scaling type: linear, yarn, ntk |
--rope-freq-base N | RoPE base frequency (override for extended context) |
--rope-freq-scale N | RoPE frequency scale factor |
--mlock | Lock model in RAM (prevents swapping) |
--no-mmap | Disable memory mapping (load fully into RAM) |
--keep N | Number of initial tokens to keep on context shift |
Sampling Parameters
Sampling controls how the model chooses the next token:
| Flag | Default | Description |
|---|---|---|
--temp / -t | 0.8 | Temperature: 0 = deterministic, >1 = creative |
--top-p | 0.9 | Nucleus sampling cutoff |
--top-k | 40 | Top-K sampling (0 = disabled) |
--min-p | 0.0 | Minimum probability cutoff |
--repeat-penalty | 1.0 | Penalize repeated tokens (1.1–1.3 typical) |
--repeat-last-n | 64 | Window for repeat penalty |
--mirostat | 0 | Mirostat mode: 0 (off), 1, or 2 |
--mirostat-lr | 0.1 | Mirostat learning rate (eta) |
--mirostat-ent | 5.0 | Mirostat target entropy (tau) |
--seed / -s | -1 | RNG seed (-1 = random) |
Temperature Guide
| Temperature | Behavior | Best For |
|---|---|---|
| 0.0 | Greedy (deterministic) | Classification, extraction |
| 0.1–0.4 | Focused, less varied | Code generation, factual Q&A |
| 0.7–0.9 | Balanced | Chat, general purpose |
| 1.0–1.2 | More creative | Brainstorming, creative writing |
| >1.5 | Unpredictable/random | Experimentation only |
Performance Flags
| Flag | Description |
|---|---|
--threads N / -t N | CPU inference threads (default: system count) |
--threads-batch N / -tb N | CPU threads for prompt processing |
--batch-size N / -b N | Logical batch size for prompt processing |
--ubatch-size N / -ub N | Physical micro-batch size |
--flash-attn / -fa | Enable Flash Attention (reduces KV memory) |
--n-gpu-layers N / -ngl N | Layers to offload to GPU |
--split-mode | Multi-GPU split: none, layer (default), row |
--tensor-split | Comma-separated GPU memory ratios |
--main-gpu N | Primary GPU index (default: 0) |
Interactive Chat Mode
Interactive mode keeps the model loaded and accepts input repeatedly:
./build/bin/llama-cli \
-m model.gguf \
-cnv \
--chat-template llama3Important interactive flags:
| Flag | Description |
|---|---|
-cnv | Conversation mode (applies chat template automatically) |
-i | Interactive mode (raw; you control prefixes) |
--interactive-first | Wait for user input before generating |
--chat-template NAME | Built-in chat template to use |
--multiline-input | Allow multi-line input (end with \) |
--color | Colorize output (user input vs model output) |
Built-in Chat Templates
| Template Name | Models |
|---|---|
llama2 | Llama-2-chat models |
llama3 | Llama-3 and 3.1 Instruct |
chatml | Qwen, Yi, phi-3, many others |
mistral | Mistral Instruct, Mixtral |
gemma | Gemma Instruct |
deepseek2 | DeepSeek-V2/V3 |
command-r | Cohere Command-R |
phi3 | Microsoft Phi-3 |
orion | OrionStar models |
zephyr | Zephyr models |
Most models include their chat template inside the GGUF file; llama.cpp reads it automatically. Use --chat-template only to override.
In-Session Commands
While in interactive mode, type these commands at the prompt:
| Command | Action |
|---|---|
/clear | Clear conversation history |
/save <filename> | Save conversation to file |
/load <filename> | Load saved conversation |
Ctrl+C | Interrupt current generation |
Ctrl+D | Exit session |
Custom System Prompt
./build/bin/llama-cli \
-m model.gguf \
-cnv \
--chat-template chatml \
-sp "You are a senior C++ engineer. Answer concisely and show code examples."Prompt Caching
Prompt caching saves the KV cache state after prefill so repeated or shared prompt prefixes skip re-evaluation:
# First run: computes and saves the cache
./build/bin/llama-cli \
-m model.gguf \
-f base_context.txt \
--prompt-cache cache.bin \
-n 200
# Subsequent runs: reuses the saved cache
./build/bin/llama-cli \
-m model.gguf \
-f base_context.txt \
--prompt-cache cache.bin \
--prompt-cache-all \
-p "Based on the above context, ..."| Flag | Description |
|---|---|
--prompt-cache FILE | Path to cache file |
--prompt-cache-ro | Read-only; don't update cache |
--prompt-cache-all | Save all tokens (not just the prefix) |
Stop Sequences
Stop sequences tell the model to stop generating when a specific string is produced:
# Stop at triple-backtick to capture just the code block
./build/bin/llama-cli \
-m model.gguf \
-p "Write a Python hello world function:
\`\`\`python" \
-r "\`\`\`"Multiple stop sequences: repeat --reverse-prompt / -r flags.
Typical Workflows
Code Generation
./build/bin/llama-cli \
-m codellama-7b-instruct.gguf \
-p "[INST] Write a Python function to parse a JSON file and return a list of dicts. [/INST]" \
--temp 0.2 \
--top-p 0.9 \
--repeat-penalty 1.1 \
-n 512Document Summarization
./build/bin/llama-cli \
-m model.gguf \
-f long_document.txt \
-p "\n\nSummarize the above document in 5 bullet points:" \
--temp 0.3 \
-n 300JSON Output
./build/bin/llama-cli \
-m model.gguf \
-p 'Extract person name and age from: "Alice is 34 years old". Respond in JSON:' \
--json-schema '{"type":"object","properties":{"name":{"type":"string"},"age":{"type":"integer"}}}' \
--temp 0See Also
- Server — HTTP API for multi-user / OpenAI-compatible access
- GPU Acceleration —
--n-gpu-layersin depth - Performance Tuning — threads, batch size, flash attention
- Advanced Features — grammars, vision models, LoRA
- Cheatsheet — all flags at a glance