Introduction to llama-swap
About 655 wordsAbout 2 min
2026-03-21
llama-swap is a lightweight HTTP proxy server that sits in front of one or more llama-server instances and automatically swaps the active model based on the model field in each API request. It exposes a single OpenAI-compatible endpoint while transparently managing which model is loaded in memory at any given time.
The Problem It Solves
Running multiple LLMs locally creates a resource management problem. You might want:
- A coding model (DeepSeek-Coder, CodeLlama) for code generation
- A chat model (Llama-3.1-8B) for general conversation
- An embedding model (nomic-embed-text) for RAG pipelines
- A vision model (LLaVA) for image analysis
Loading all of them simultaneously is impractical. A 7B Q4_K_M model uses roughly 4–5 GB of VRAM, so four such models would require 20 GB — far exceeding what most developer GPUs can hold.
Without llama-swap, you'd need to:
- Manually restart
llama-serverwith a different--modelflag for each request - Run one server per model and route requests yourself
- Use a full-featured LLM platform (Ollama, LM Studio) which adds complexity
llama-swap solves this with a minimal proxy that handles model switching automatically.
What llama-swap Does
Client (OpenAI SDK / curl / OpenWebUI)
|
v
llama-swap proxy ← single API endpoint
|
|--- on "model: llama-3.1-8b" → starts llama-server with Llama-3.1-8B
|--- on "model: deepseek-coder" → stops previous, starts DeepSeek-Coder
|--- on "model: nomic-embed" → starts (or reuses) nomic-embed-text server
|
llama-server (llama.cpp) ← actual inferenceWhen a request arrives specifying "model": "deepseek-coder", llama-swap:
- Checks if
deepseek-coderis already running - If not: stops the currently active model (freeing VRAM), then starts the
deepseek-coderserver - Waits for the new server to be healthy
- Proxies the request to it and returns the response
Key Features
| Feature | Description |
|---|---|
| OpenAI-compatible API | Drop-in replacement for any OpenAI client (/v1/chat/completions, /v1/completions, /v1/embeddings) |
| Automatic model swapping | Zero manual intervention; swap happens on every request |
| YAML configuration | Map model aliases to full llama-server start commands |
| Model groups | Keep multiple models resident simultaneously if VRAM allows |
| TTL unloading | Auto-unload idle models after a configurable timeout |
| Persistent models | Pin high-priority models so they are never swapped out |
| Health monitoring | Built-in /health endpoint reflecting upstream server health |
| Swagger UI | Interactive API docs at /swagger |
| Multimodal support | Proxy vision models (--mmproj) transparently |
| Minimal footprint | Single Go binary, no runtime dependencies |
When to Use llama-swap
| Scenario | Use llama-swap? |
|---|---|
| Single model, always loaded | No — just run llama-server directly |
| Multiple models, limited VRAM | Yes |
| OpenWebUI or similar frontend needing multiple models | Yes |
| CI/automated pipeline switching models per task | Yes |
| Production multi-GPU inference (high QPS) | Prefer full platform (vLLM, TGI) |
| Just exploring llama.cpp | No |
Relationship to llama.cpp
llama-swap does not perform inference itself. It delegates entirely to llama-server from llama.cpp. You must have llama.cpp built and llama-server available. llama-swap manages the lifecycle (start/stop) of those server processes.
llama-swap (Go, ~10 MB binary) → manages
llama-server (C++, llama.cpp) → performs inferenceQuick Start (30 seconds)
# Install llama-swap (Linux/macOS)
curl -LO https://github.com/mostlygeek/llama-swap/releases/latest/download/llama-swap-linux-amd64
chmod +x llama-swap-linux-amd64
# Create a minimal config
cat > config.yaml << 'EOF'
models:
"llama3":
cmd: "/path/to/llama-server -m /models/llama-3.1-8b-q4_k_m.gguf --port {PORT} -ngl 99"
"nomic-embed":
cmd: "/path/to/llama-server -m /models/nomic-embed-text-q4_k_m.gguf --port {PORT} --embedding"
EOF
# Start the proxy
./llama-swap-linux-amd64 --config config.yaml --listen :8080
# Use it like any OpenAI server
curl http://localhost:8080/v1/chat/completions \
-d '{"model":"llama3","messages":[{"role":"user","content":"Hello!"}]}'See Also
- Installation — downloading, building, and running llama-swap
- Configuration — full config.yaml reference
- Model Management — groups, TTL, persistence
- API Reference — all endpoints in detail
- Integration — connecting OpenWebUI, LangChain, and more