Cheatsheet

About 807 wordsAbout 3 min

2026-03-21

Quick reference for installation, config schema, all endpoints, and common recipes.

Installation

# Download binary (Linux x86_64)
curl -LO https://github.com/mostlygeek/llama-swap/releases/latest/download/llama-swap-linux-amd64
chmod +x llama-swap-linux-amd64 && sudo mv llama-swap-linux-amd64 /usr/local/bin/llama-swap

# Build from source (requires Go 1.21+)
git clone https://github.com/mostlygeek/llama-swap && cd llama-swap
go build -o llama-swap ./cmd/llama-swap

# Docker
docker run -d --name llama-swap -p 8080:8080 \
  -v ./config.yaml:/config.yaml -v /models:/models \
  ghcr.io/mostlygeek/llama-swap:latest --config /config.yaml

Launch Command

llama-swap [flags]

Flags:
  --config FILE      Config file path (default: config.yaml)
  --listen ADDR      Listen address (default: :8080)
  --log-level LEVEL  debug | info | warn | error  (default: info)
  --version          Print version

Minimal config.yaml

models:
  "llama3":
    cmd: "llama-server -m /models/llama-3.1-8b-q4_k_m.gguf --port {PORT} -ngl 99"

Full config.yaml Schema

# Global settings
healthCheckTimeout: 30       # seconds to wait for server startup
healthCheckInterval: 500     # ms between health polls
modelTTL: 0                  # global idle TTL (0 = never unload)

models:
  "alias":
    cmd: "llama-server -m /path/to/model.gguf --port {PORT} [flags]"
    persist: false           # never auto-unload if true
    ttl: 0                   # per-model TTL override (0 = use global)
    logFile: "/tmp/alias.log" # redirect server stdout/stderr

groups:
  "group-name":
    swap: true               # swap group atomically
    members:
      - "alias-1"
      - "alias-2"

Common Model Configs

models:
  # General chat (GPU, flash attention, parallel slots)
  "llama3":
    cmd: >
      llama-server -m /models/llama-3.1-8b-instruct-q4_k_m.gguf
      --port {PORT} -ngl 99 --flash-attn -c 8192 -np 2 --cont-batching

  # Coding model
  "deepseek-coder":
    cmd: >
      llama-server -m /models/deepseek-coder-v2-lite-q4_k_m.gguf
      --port {PORT} -ngl 99 --flash-attn -c 16384

  # Embedding model (always loaded)
  "nomic-embed":
    cmd: "llama-server -m /models/nomic-embed-text-q4_k_m.gguf --port {PORT} --embedding -ngl 99"
    persist: true

  # Vision model (auto-unload after 2 min)
  "llava":
    cmd: >
      llama-server -m /models/llava-v1.6-q4_k_m.gguf
      --mmproj /models/llava-v1.6-mmproj-f16.gguf
      --port {PORT} -ngl 99
    ttl: 120

  # Small fast model (never unload)
  "llama3-3b":
    cmd: "llama-server -m /models/llama-3.2-3b-q4_k_m.gguf --port {PORT} -ngl 99"
    persist: true

# Run chat + embed together
groups:
  "rag-pipeline":
    swap: true
    members: ["llama3", "nomic-embed"]

API Endpoints Quick Reference

Method	Endpoint	Description
GET	`/health`	Overall health + running models
GET	`/running`	List of currently running model names
GET	`/v1/models`	List all configured models
POST	`/v1/chat/completions`	OpenAI chat API (triggers swap)
POST	`/v1/completions`	OpenAI text completions
POST	`/v1/embeddings`	Embeddings (model must have `--embedding`)
DELETE	`/upstream/{model}`	Force-unload a running model
GET	`/upstream/{model}/load`	Force-load a model
GET	`/swagger`	Interactive API docs

curl Recipes

# Health check
curl http://localhost:8080/health

# List models
curl http://localhost:8080/v1/models | python3 -m json.tool

# Chat
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}'

# Stream chat
curl http://localhost:8080/v1/chat/completions \
  -d '{"model":"llama3","messages":[{"role":"user","content":"Count to 5"}],"stream":true}'

# Embed
curl http://localhost:8080/v1/embeddings \
  -d '{"model":"nomic-embed","input":"text to embed"}'

# Force unload
curl -X DELETE http://localhost:8080/upstream/llama3

# Pre-warm (load without inference)
curl http://localhost:8080/upstream/llama3/load

Python Quick Start

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

# Chat
response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

# Embed
vec = client.embeddings.create(model="nomic-embed", input="hello").data[0].embedding

# Stream
for chunk in client.chat.completions.create(model="llama3", messages=[...], stream=True):
    print(chunk.choices[0].delta.content or "", end="")

Integration Summary

Tool	Config
OpenAI SDK	`base_url="http://localhost:8080/v1", api_key="not-needed"`
LangChain	`ChatOpenAI(base_url="...", model="alias")`
OpenWebUI	Settings → Connections → API Base URL: `http://localhost:8080/v1`
Continue.dev	`"apiBase": "http://localhost:8080/v1"` in config.json
llama-index	`api_base="http://localhost:8080/v1"`
env vars	`OPENAI_API_BASE=http://localhost:8080/v1`

OAI Alias Trick (Drop-in Replacement)

Name your models to match OpenAI model names for fully transparent switching:

models:
  "gpt-4o":
    cmd: "llama-server -m /models/llama-3.1-70b-q4_k_m.gguf --port {PORT} -ngl 99"
  "gpt-4o-mini":
    cmd: "llama-server -m /models/llama-3.2-3b-q4_k_m.gguf --port {PORT} -ngl 99"
  "text-embedding-3-small":
    cmd: "llama-server -m /models/nomic-embed-text-q4_k_m.gguf --port {PORT} --embedding"
    persist: true

Quick Troubleshooting

Symptom	Fix
`binary not found`	Use absolute path in `cmd:`
`address in use`	Change `--listen` port or kill existing process
`YAML parse error`	Validate with `python3 -c "import yaml; yaml.safe_load(open('config.yaml'))"`
`health check timeout`	Increase `healthCheckTimeout:` / add `logFile:` to debug
`model not found`	Check alias matches exactly what client sends
`OOM in group`	Reduce `-ngl` for each group member
`{PORT} not substituted`	Ensure `{PORT}` (uppercase, with braces) is in `cmd:`

AI

VAD

ASR

TTS

llama-swap

llama.cpp

Embedded Sytems

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

DevOps

Conan

Artifactory

Jenkins

Cheatsheet

Installation

Launch Command

Minimal config.yaml

Full config.yaml Schema

Common Model Configs

API Endpoints Quick Reference

curl Recipes

Python Quick Start

Integration Summary

OAI Alias Trick (Drop-in Replacement)

Quick Troubleshooting

See Also