Troubleshooting
About 1189 wordsAbout 4 min
2026-03-21
This guide covers the most common issues encountered when running llama-swap, including startup failures, model load errors, proxy issues, and performance problems.
Diagnostic First Steps
# 1. Run with debug logging
llama-swap --config config.yaml --log-level debug
# 2. Check llama-swap health
curl http://localhost:8080/health
# 3. Check which models are running
curl http://localhost:8080/running
# 4. Verify your config parses correctly
python3 -c "import yaml; yaml.safe_load(open('config.yaml'))" && echo "YAML OK"
# 5. Test llama-server directly (outside llama-swap)
llama-server -m /path/to/model.gguf --port 8081 &
curl http://localhost:8081/healthStartup Issues
llama-swap Exits Immediately
Symptom: Binary exits right away, nothing on port 8080.
Check: Did it print an error?
llama-swap --config config.yaml 2>&1 | head -20Common causes:
| Error Message | Cause | Fix |
|---|---|---|
failed to parse config | YAML syntax error | Validate YAML (see below) |
no such file or directory: config.yaml | Config file not found | Use --config /absolute/path/config.yaml |
address already in use :8080 | Port conflict | Use --listen :8081 or stop conflicting process |
YAML Parse Error
Error: failed to parse config: yaml: line 5: could not find expected ':'Fix: Validate your YAML:
python3 -c "
import yaml, sys
try:
yaml.safe_load(open('config.yaml'))
print('YAML OK')
except yaml.YAMLError as e:
print(f'ERROR: {e}')
sys.exit(1)
"Common YAML mistakes:
- Tab characters instead of spaces (YAML requires spaces)
- Missing quotes around model names with special characters
- Wrong indentation on the
cmd:line - Unescaped
{or}in command strings (must be plain, only{PORT}is special)
Port Already in Use
Error: listen tcp :8080: bind: address already in use# Find what is using port 8080
lsof -i :8080
ss -tlnp | grep 8080
# Kill it or choose a different port
llama-swap --config config.yaml --listen :8081Model Load Failures
Health Check Timeout
Error: upstream health check timed out after 30s for model "llama3"Cause: llama-server did not become healthy within healthCheckTimeout seconds.
Fix options:
Increase timeout for large models:
healthCheckTimeout: 120 # 2 minutes for 70B modelsTest
llama-serverdirectly to confirm it starts at all:llama-server -m /models/model.gguf --port 8099 -ngl 99Check logs (add
logFileto the model config):models: "llama3": cmd: "llama-server -m /models/model.gguf --port {PORT} -ngl 99" logFile: "/tmp/llama3-server.log"Then:
tail -f /tmp/llama3-server.log
llama-server Not Found
Error: exec: "llama-server": executable file not found in $PATHFix: Use the absolute path in cmd:
models:
"llama3":
cmd: "/home/user/llama.cpp/build/bin/llama-server -m /models/model.gguf --port {PORT}"Or add the directory to PATH in the systemd service environment:
Environment=PATH=/home/user/llama.cpp/build/bin:/usr/local/bin:/usr/bin:/binModel File Not Found
# In llama-server logs:
error: failed to open model file '/models/model.gguf'Fix: Verify the path is correct and the file is accessible by the user running llama-swap:
ls -la /models/model.gguf
# Run as the llama-swap service user:
sudo -u llama ls /models/model.gguf{PORT} Missing from Command
Symptom: llama-server fails to start; you may see a port conflict or the health check fails because the server started on the wrong address.
Fix: Ensure {PORT} (all caps, with curly braces) appears exactly once in every cmd:
# Wrong
cmd: "llama-server -m model.gguf -ngl 99"
# Correct
cmd: "llama-server -m model.gguf --port {PORT} -ngl 99"Proxy / API Errors
HTTP 404: Model Not Found
{"error": "model 'gpt4' not found in config"}Cause: The client requested a model name that doesn't exist in your config.yaml.
Fix: Check the model alias in your config:
curl http://localhost:8080/v1/models | python3 -m json.tool | grep '"id"'Ensure the client uses exactly the same string as the key in models:.
HTTP 503: Upstream Not Ready
{"error": "upstream not ready: health check failed"}Cause: llama-swap started the model server but it didn't respond to /health within the timeout.
Steps:
- Increase
healthCheckTimeout - Check server logs (add
logFile:to the model) - Run the server command manually to check for errors
Requests Hang Indefinitely
Cause: A swap is in progress — large model taking a long time to load.
This is expected for cold starts of large models. The client is waiting for the model to become healthy.
Mitigation:
- Pre-warm models:
curl http://localhost:8080/upstream/llama3/loadbefore sending inference requests - Set a request timeout on the client side
Wrong Response (Seems Like a Different Model)
Symptom: You requested deepseek-coder but the response seems like a general chat model.
Cause: Model swap didn't happen; a cached process from a previous session responded.
Fix: Force-unload all models and re-request:
curl -X DELETE http://localhost:8080/upstream/old-model-nameGPU and Memory Issues
OOM When Starting Second Model in a Group
Symptom: Second model in a group fails to load with CUDA OOM.
Fix: Reduce --n-gpu-layers for group members so both fit in VRAM:
groups:
"chat-pipeline":
swap: true
members: ["llama3", "nomic-embed"]
models:
"llama3":
# Reduced from 99 to 28 layers — partial GPU offload
cmd: "llama-server -m /models/llama-3.1-8b.gguf --port {PORT} -ngl 28"
"nomic-embed":
cmd: "llama-server -m /models/nomic-embed.gguf --port {PORT} --embedding -ngl 16"Slow Model Switching
Symptom: Model swaps take much longer than expected.
Causes and fixes:
| Cause | Fix |
|---|---|
| Model on spinning HDD | Move models to SSD/NVMe |
| Very large model (70B+) | Use smaller quantization (Q3_K_M instead of Q5_K_M) |
--mlock enabled | The OS had to page the previous model out — consider more RAM |
| GPU driver initialization | First swap is always slower due to CUDA context creation |
llama-server Crashes Immediately (Segfault)
Cause: Usually a GPU driver/compatibility issue.
Debug:
# Run server manually with the exact command from config
llama-server -m /models/model.gguf --port 9999 -ngl 99
# Check exit code
echo "Exit: $?"If it segfaults, see the llama.cpp Troubleshooting guide for GPU-specific issues.
Connectivity Issues
OpenWebUI Can't Connect
Symptom: OpenWebUI shows "connection refused" or no models listed.
Fix checklist:
- llama-swap must listen on
0.0.0.0, not127.0.0.1, if OpenWebUI is in Docker:llama-swap --config config.yaml --listen 0.0.0.0:8080 - Use the host's LAN IP in OpenWebUI settings, not
localhost - If both are in Docker Compose, use the container service name as the hostname
API Key Errors
Symptom: Client complains about invalid API key.
Note: llama-swap does not validate API keys. If you see auth errors, they may come from:
- A
--api-keyflag in your llama-servercmd— make sure you include the correct header in client requests - The client library requiring a non-empty key — use any string like
"not-needed"
Log Analysis
Enable debug logging and capture output:
llama-swap --config config.yaml --log-level debug 2>&1 | tee llama-swap.logKey log lines to look for:
| Log Pattern | Meaning |
|---|---|
starting model "llama3" | Swap initiated |
model "llama3" is ready | Health check passed, ready to serve |
stopping model "llama3" | Model being unloaded |
health check failed | llama-server unhealthy |
proxying request to :XXXXX | Request routed to backend |
See Also
- Installation — PATH and binary location issues
- Configuration — YAML structure and
{PORT}placeholder - Model Management — groups and VRAM planning
- llama.cpp Troubleshooting — fixing the underlying llama-server