Troubleshooting

About 1189 wordsAbout 4 min

2026-03-21

This guide covers the most common issues encountered when running llama-swap, including startup failures, model load errors, proxy issues, and performance problems.

Diagnostic First Steps

# 1. Run with debug logging
llama-swap --config config.yaml --log-level debug

# 2. Check llama-swap health
curl http://localhost:8080/health

# 3. Check which models are running
curl http://localhost:8080/running

# 4. Verify your config parses correctly
python3 -c "import yaml; yaml.safe_load(open('config.yaml'))" && echo "YAML OK"

# 5. Test llama-server directly (outside llama-swap)
llama-server -m /path/to/model.gguf --port 8081 &
curl http://localhost:8081/health

Startup Issues

llama-swap Exits Immediately

Symptom: Binary exits right away, nothing on port 8080.

Check: Did it print an error?

llama-swap --config config.yaml 2>&1 | head -20

Common causes:

Error Message	Cause	Fix
`failed to parse config`	YAML syntax error	Validate YAML (see below)
`no such file or directory: config.yaml`	Config file not found	Use `--config /absolute/path/config.yaml`
`address already in use :8080`	Port conflict	Use `--listen :8081` or stop conflicting process

YAML Parse Error

Error: failed to parse config: yaml: line 5: could not find expected ':'

Fix: Validate your YAML:

python3 -c "
import yaml, sys
try:
    yaml.safe_load(open('config.yaml'))
    print('YAML OK')
except yaml.YAMLError as e:
    print(f'ERROR: {e}')
    sys.exit(1)
"

Common YAML mistakes:

Tab characters instead of spaces (YAML requires spaces)
Missing quotes around model names with special characters
Wrong indentation on the cmd: line
Unescaped { or } in command strings (must be plain, only {PORT} is special)

Port Already in Use

Error: listen tcp :8080: bind: address already in use

# Find what is using port 8080
lsof -i :8080
ss -tlnp | grep 8080

# Kill it or choose a different port
llama-swap --config config.yaml --listen :8081

Model Load Failures

Health Check Timeout

Error: upstream health check timed out after 30s for model "llama3"

Cause: llama-server did not become healthy within healthCheckTimeout seconds.

Fix options:

Increase timeout for large models:

healthCheckTimeout: 120   # 2 minutes for 70B models

Test llama-server directly to confirm it starts at all:

llama-server -m /models/model.gguf --port 8099 -ngl 99

Check logs (add logFile to the model config):

models:
  "llama3":
    cmd: "llama-server -m /models/model.gguf --port {PORT} -ngl 99"
    logFile: "/tmp/llama3-server.log"

Then: tail -f /tmp/llama3-server.log

`llama-server` Not Found

Error: exec: "llama-server": executable file not found in $PATH

Fix: Use the absolute path in cmd:

models:
  "llama3":
    cmd: "/home/user/llama.cpp/build/bin/llama-server -m /models/model.gguf --port {PORT}"

Or add the directory to PATH in the systemd service environment:

Environment=PATH=/home/user/llama.cpp/build/bin:/usr/local/bin:/usr/bin:/bin

Model File Not Found

# In llama-server logs:
error: failed to open model file '/models/model.gguf'

Fix: Verify the path is correct and the file is accessible by the user running llama-swap:

ls -la /models/model.gguf
# Run as the llama-swap service user:
sudo -u llama ls /models/model.gguf

`{PORT}` Missing from Command

Symptom: llama-server fails to start; you may see a port conflict or the health check fails because the server started on the wrong address.

Fix: Ensure {PORT} (all caps, with curly braces) appears exactly once in every cmd:

# Wrong
cmd: "llama-server -m model.gguf -ngl 99"

# Correct
cmd: "llama-server -m model.gguf --port {PORT} -ngl 99"

Proxy / API Errors

HTTP 404: Model Not Found

{"error": "model 'gpt4' not found in config"}

Cause: The client requested a model name that doesn't exist in your config.yaml.

Fix: Check the model alias in your config:

curl http://localhost:8080/v1/models | python3 -m json.tool | grep '"id"'

Ensure the client uses exactly the same string as the key in models:.

HTTP 503: Upstream Not Ready

{"error": "upstream not ready: health check failed"}

Cause: llama-swap started the model server but it didn't respond to /health within the timeout.

Steps:

Increase healthCheckTimeout
Check server logs (add logFile: to the model)
Run the server command manually to check for errors

Requests Hang Indefinitely

Cause: A swap is in progress — large model taking a long time to load.

This is expected for cold starts of large models. The client is waiting for the model to become healthy.

Mitigation:

Pre-warm models: curl http://localhost:8080/upstream/llama3/load before sending inference requests
Set a request timeout on the client side

Wrong Response (Seems Like a Different Model)

Symptom: You requested deepseek-coder but the response seems like a general chat model.

Cause: Model swap didn't happen; a cached process from a previous session responded.

Fix: Force-unload all models and re-request:

curl -X DELETE http://localhost:8080/upstream/old-model-name

GPU and Memory Issues

OOM When Starting Second Model in a Group

Symptom: Second model in a group fails to load with CUDA OOM.

Fix: Reduce --n-gpu-layers for group members so both fit in VRAM:

groups:
  "chat-pipeline":
    swap: true
    members: ["llama3", "nomic-embed"]

models:
  "llama3":
    # Reduced from 99 to 28 layers — partial GPU offload
    cmd: "llama-server -m /models/llama-3.1-8b.gguf --port {PORT} -ngl 28"
  "nomic-embed":
    cmd: "llama-server -m /models/nomic-embed.gguf --port {PORT} --embedding -ngl 16"

Slow Model Switching

Symptom: Model swaps take much longer than expected.

Causes and fixes:

Cause	Fix
Model on spinning HDD	Move models to SSD/NVMe
Very large model (70B+)	Use smaller quantization (Q3_K_M instead of Q5_K_M)
`--mlock` enabled	The OS had to page the previous model out — consider more RAM
GPU driver initialization	First swap is always slower due to CUDA context creation

llama-server Crashes Immediately (Segfault)

Cause: Usually a GPU driver/compatibility issue.

Debug:

# Run server manually with the exact command from config
llama-server -m /models/model.gguf --port 9999 -ngl 99

# Check exit code
echo "Exit: $?"

If it segfaults, see the llama.cpp Troubleshooting guide for GPU-specific issues.

Connectivity Issues

OpenWebUI Can't Connect

Symptom: OpenWebUI shows "connection refused" or no models listed.

Fix checklist:

llama-swap must listen on 0.0.0.0, not 127.0.0.1, if OpenWebUI is in Docker:
```
llama-swap --config config.yaml --listen 0.0.0.0:8080
```
Use the host's LAN IP in OpenWebUI settings, not localhost
If both are in Docker Compose, use the container service name as the hostname

API Key Errors

Symptom: Client complains about invalid API key.

Note: llama-swap does not validate API keys. If you see auth errors, they may come from:

A --api-key flag in your llama-server cmd — make sure you include the correct header in client requests
The client library requiring a non-empty key — use any string like "not-needed"

Log Analysis

Enable debug logging and capture output:

llama-swap --config config.yaml --log-level debug 2>&1 | tee llama-swap.log

Key log lines to look for:

Log Pattern	Meaning
`starting model "llama3"`	Swap initiated
`model "llama3" is ready`	Health check passed, ready to serve
`stopping model "llama3"`	Model being unloaded
`health check failed`	llama-server unhealthy
`proxying request to :XXXXX`	Request routed to backend

VAD

ASR

TTS

llama-swap

llama.cpp

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

Conan

Artifactory

Jenkins

Troubleshooting