CLI Usage

About 979 wordsAbout 3 min

2026-03-21

llama-cli is the command-line inference binary. It handles one-shot prompts, interactive chat sessions, prompt caching, and structured output — all without running a server.

Quick Reference

# Single prompt, print response, exit
./build/bin/llama-cli -m model.gguf -p "Explain what a tensor is"

# Interactive chat session (conversational)
./build/bin/llama-cli -m model.gguf -cnv

# Read prompt from file
./build/bin/llama-cli -m model.gguf -f prompt.txt

# GPU-accelerated inference
./build/bin/llama-cli -m model.gguf -p "Hello" --n-gpu-layers 99

Core Flags

Model and Input

Flag	Short	Description
`--model`	`-m`	Path to GGUF model file (required)
`--prompt`	`-p`	Prompt string
`--file`	`-f`	Read prompt from file
`--system-prompt`	`-sp`	System prompt text
`--in-prefix`		Prefix appended before every user input
`--in-suffix`		Suffix appended after every user input
`--reverse-prompt`	`-r`	Stop and return control on this string

Output Control

Flag	Short	Description
`--n-predict`	`-n`	Max tokens to generate (-1 = infinite)
`--no-display-prompt`		Suppress prompt echo on non-interactive mode
`--log-disable`		Disable progress/status output
`--verbose`	`-v`	Verbose logging to stderr

Context and Memory Flags

Flag	Description
`--ctx-size N` / `-c N`	Context window size (default: 4096 or model max)
`--rope-scaling TYPE`	RoPE scaling type: `linear`, `yarn`, `ntk`
`--rope-freq-base N`	RoPE base frequency (override for extended context)
`--rope-freq-scale N`	RoPE frequency scale factor
`--mlock`	Lock model in RAM (prevents swapping)
`--no-mmap`	Disable memory mapping (load fully into RAM)
`--keep N`	Number of initial tokens to keep on context shift

Sampling Parameters

Sampling controls how the model chooses the next token:

Flag	Default	Description
`--temp` / `-t`	0.8	Temperature: 0 = deterministic, >1 = creative
`--top-p`	0.9	Nucleus sampling cutoff
`--top-k`	40	Top-K sampling (0 = disabled)
`--min-p`	0.0	Minimum probability cutoff
`--repeat-penalty`	1.0	Penalize repeated tokens (1.1–1.3 typical)
`--repeat-last-n`	64	Window for repeat penalty
`--mirostat`	0	Mirostat mode: 0 (off), 1, or 2
`--mirostat-lr`	0.1	Mirostat learning rate (eta)
`--mirostat-ent`	5.0	Mirostat target entropy (tau)
`--seed` / `-s`	-1	RNG seed (-1 = random)

Temperature Guide

Temperature	Behavior	Best For
0.0	Greedy (deterministic)	Classification, extraction
0.1–0.4	Focused, less varied	Code generation, factual Q&A
0.7–0.9	Balanced	Chat, general purpose
1.0–1.2	More creative	Brainstorming, creative writing
>1.5	Unpredictable/random	Experimentation only

Performance Flags

Flag	Description
`--threads N` / `-t N`	CPU inference threads (default: system count)
`--threads-batch N` / `-tb N`	CPU threads for prompt processing
`--batch-size N` / `-b N`	Logical batch size for prompt processing
`--ubatch-size N` / `-ub N`	Physical micro-batch size
`--flash-attn` / `-fa`	Enable Flash Attention (reduces KV memory)
`--n-gpu-layers N` / `-ngl N`	Layers to offload to GPU
`--split-mode`	Multi-GPU split: `none`, `layer` (default), `row`
`--tensor-split`	Comma-separated GPU memory ratios
`--main-gpu N`	Primary GPU index (default: 0)

Interactive Chat Mode

Interactive mode keeps the model loaded and accepts input repeatedly:

./build/bin/llama-cli \
  -m model.gguf \
  -cnv \
  --chat-template llama3

Important interactive flags:

Flag	Description
`-cnv`	Conversation mode (applies chat template automatically)
`-i`	Interactive mode (raw; you control prefixes)
`--interactive-first`	Wait for user input before generating
`--chat-template NAME`	Built-in chat template to use
`--multiline-input`	Allow multi-line input (end with `\`)
`--color`	Colorize output (user input vs model output)

Built-in Chat Templates

Template Name	Models
`llama2`	Llama-2-chat models
`llama3`	Llama-3 and 3.1 Instruct
`chatml`	Qwen, Yi, phi-3, many others
`mistral`	Mistral Instruct, Mixtral
`gemma`	Gemma Instruct
`deepseek2`	DeepSeek-V2/V3
`command-r`	Cohere Command-R
`phi3`	Microsoft Phi-3
`orion`	OrionStar models
`zephyr`	Zephyr models

Most models include their chat template inside the GGUF file; llama.cpp reads it automatically. Use --chat-template only to override.

In-Session Commands

While in interactive mode, type these commands at the prompt:

Command	Action
`/clear`	Clear conversation history
`/save <filename>`	Save conversation to file
`/load <filename>`	Load saved conversation
`Ctrl+C`	Interrupt current generation
`Ctrl+D`	Exit session

Custom System Prompt

./build/bin/llama-cli \
  -m model.gguf \
  -cnv \
  --chat-template chatml \
  -sp "You are a senior C++ engineer. Answer concisely and show code examples."

Prompt Caching

Prompt caching saves the KV cache state after prefill so repeated or shared prompt prefixes skip re-evaluation:

# First run: computes and saves the cache
./build/bin/llama-cli \
  -m model.gguf \
  -f base_context.txt \
  --prompt-cache cache.bin \
  -n 200

# Subsequent runs: reuses the saved cache
./build/bin/llama-cli \
  -m model.gguf \
  -f base_context.txt \
  --prompt-cache cache.bin \
  --prompt-cache-all \
  -p "Based on the above context, ..."

Flag	Description
`--prompt-cache FILE`	Path to cache file
`--prompt-cache-ro`	Read-only; don't update cache
`--prompt-cache-all`	Save all tokens (not just the prefix)

Stop Sequences

Stop sequences tell the model to stop generating when a specific string is produced:

# Stop at triple-backtick to capture just the code block
./build/bin/llama-cli \
  -m model.gguf \
  -p "Write a Python hello world function:
\`\`\`python" \
  -r "\`\`\`"

Multiple stop sequences: repeat --reverse-prompt / -r flags.

Typical Workflows

Code Generation

./build/bin/llama-cli \
  -m codellama-7b-instruct.gguf \
  -p "[INST] Write a Python function to parse a JSON file and return a list of dicts. [/INST]" \
  --temp 0.2 \
  --top-p 0.9 \
  --repeat-penalty 1.1 \
  -n 512

Document Summarization

./build/bin/llama-cli \
  -m model.gguf \
  -f long_document.txt \
  -p "\n\nSummarize the above document in 5 bullet points:" \
  --temp 0.3 \
  -n 300

JSON Output

./build/bin/llama-cli \
  -m model.gguf \
  -p 'Extract person name and age from: "Alice is 34 years old". Respond in JSON:' \
  --json-schema '{"type":"object","properties":{"name":{"type":"string"},"age":{"type":"integer"}}}' \
  --temp 0

AI

VAD

ASR

TTS

llama-swap

llama.cpp

Embedded Sytems

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

DevOps

Conan

Artifactory

Jenkins