API Reference

About 829 wordsAbout 3 min

2026-03-21

llama-swap exposes an OpenAI-compatible REST API alongside additional management endpoints. All inference endpoints proxy directly to the underlying llama-server instance.

Base URL

Default: http://localhost:8080

All OpenAI-compatible endpoints are under /v1/. llama-swap management endpoints are at the root.

OpenAI-Compatible Endpoints

GET /v1/models

List all models defined in config.yaml.

curl http://localhost:8080/v1/models

Response:

{
  "object": "list",
  "data": [
    {"id": "llama3", "object": "model", "created": 1714000000, "owned_by": "llama-swap"},
    {"id": "deepseek-coder", "object": "model", "created": 1714000000, "owned_by": "llama-swap"},
    {"id": "nomic-embed", "object": "model", "created": 1714000000, "owned_by": "llama-swap"}
  ]
}

POST /v1/chat/completions

OpenAI-compatible chat completions. Triggers model swap if the requested model is not loaded.

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain what a mutex is."}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

Streaming:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

POST /v1/completions

OpenAI-compatible text completions.

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-coder",
    "prompt": "def binary_search(arr, target):",
    "max_tokens": 256,
    "temperature": 0.1,
    "stop": ["\n\n"]
  }'

POST /v1/embeddings

Generate embedding vectors. The target model must have been started with --embedding.

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed",
    "input": "The quick brown fox jumps over the lazy dog"
  }'

Batch embeddings:

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed",
    "input": [
      "First sentence to embed",
      "Second sentence to embed",
      "Third sentence to embed"
    ]
  }'

Response:

{
  "object": "list",
  "data": [
    {"object": "embedding", "index": 0, "embedding": [0.012, -0.034, ...]},
    {"object": "embedding", "index": 1, "embedding": [0.056, -0.078, ...]},
    {"object": "embedding", "index": 2, "embedding": [0.090, -0.123, ...]}
  ],
  "model": "nomic-embed",
  "usage": {"prompt_tokens": 18, "total_tokens": 18}
}

Management Endpoints

GET /health

Returns the overall health of llama-swap and currently running model servers.

curl http://localhost:8080/health

Response when a model is loaded:

{
  "status": "ok",
  "models": {
    "llama3": {"status": "running", "pid": 12345},
    "nomic-embed": {"status": "running", "pid": 12346}
  }
}

Response when no model is loaded:

{"status": "ok", "models": {}}

GET /running

Returns the list of currently running model names.

curl http://localhost:8080/running

Response:

{"running": ["llama3", "nomic-embed"]}

Empty:

{"running": []}

DELETE /upstream/

Force-unload a running model. Sends SIGTERM to the underlying llama-server process and frees its resources.

curl -X DELETE http://localhost:8080/upstream/llama3

Response:

{"status": "ok", "message": "llama3 stopped"}

If the model is not running:

{"status": "error", "message": "llama3 is not running"}

GET /upstream/{model}/load

Force-load a model without sending an inference request. Useful for pre-warming a model before expected usage.

curl http://localhost:8080/upstream/llama3/load

Response after successful startup:

{"status": "ok", "message": "llama3 is ready"}

GET /swagger

Interactive Swagger UI documenting all endpoints. Accessible in a browser at:

http://localhost:8080/swagger

GET /metrics (if enabled)

Prometheus-compatible metrics endpoint (not available in all builds).

curl http://localhost:8080/metrics

Native llama-server Passthrough

llama-swap transparently proxies all requests to the underlying llama-server. This means native llama-server endpoints also work:

# Native completion endpoint (llama-server specific)
curl http://localhost:8080/completion \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3","prompt":"What is Rust?","n_predict":200,"grammar":""}'

# Check slot status
curl http://localhost:8080/slots

The model field in the JSON body (or the request URL, depending on the endpoint) is used to route to the correct server.

Authentication

If llama-server was started with --api-key, include the key in requests:

# If your config uses: --api-key mysecretkey
curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer mysecretkey" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3","messages":[{"role":"user","content":"Hello"}]}'

llama-swap itself does not implement authentication — auth is handled by the upstream llama-server.

Error Responses

HTTP Status	Cause	Example
404	Model not found in config	`{"error":"model 'foo' not found"}`
503	Model failed to start or health check timed out	`{"error":"upstream not ready"}`
500	llama-server returned an error	Forwarded from upstream

Handling Model-Not-Found

curl http://localhost:8080/v1/chat/completions \
  -d '{"model":"nonexistent","messages":[{"role":"user","content":"test"}]}'

# Response:
# HTTP 404
# {"error": "model 'nonexistent' not found in config"}

Handling Cold-Start Timeout

If healthCheckTimeout is exceeded (model takes too long to start):

HTTP 503
{"error": "upstream health check timed out after 30s"}

Fix: Increase healthCheckTimeout in config for large models.

Request Flow Diagram

curl → llama-swap
         │
         ├─ Parse "model" from request body
         ├─ Check if model is running
         │         │
         │    NOT RUNNING
         │         │
         │    ┌────┴──────────────────┐
         │    │  Swap operation       │
         │    │  1. Stop old model(s) │
         │    │  2. Start new model   │
         │    │  3. Poll /health      │
         │    └────┬──────────────────┘
         │         │
         └─ Proxy request to upstream llama-server
                   │
              Response ──→ curl

VAD

ASR

TTS

llama-swap

llama.cpp

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

Conan

Artifactory

Jenkins

API Reference