Integration Guide

About 841 wordsAbout 3 min

2026-03-21

llama-swap presents a standard OpenAI-compatible API, which means any tool or library that works with OpenAI's API works with llama-swap — just change the base_url to point to your llama-swap instance.

OpenAI Python SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"   # required by client library, ignored by llama-swap
)

# Chat
response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Explain merge sort"}],
    temperature=0.7,
    max_tokens=512
)
print(response.choices[0].message.content)

# Stream
stream = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Write a Rust hello world"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

# Embeddings
embedding = client.embeddings.create(
    model="nomic-embed",
    input="Query text to embed"
)
vector = embedding.data[0].embedding

Switching Models in One Application

With llama-swap, the same client can seamlessly use different models:

def chat(model: str, prompt: str) -> str:
    return client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=512
    ).choices[0].message.content

# These run with automatic model swapping
code = chat("deepseek-coder", "Write a Python binary search function")
summary = chat("llama3", f"Explain this code: {code}")
vector = client.embeddings.create(model="nomic-embed", input=summary).data[0].embedding

OpenWebUI

OpenWebUI is a popular self-hosted chat interface. Connect it to llama-swap to expose all your local models through a single UI.

Setup

Start llama-swap on http://localhost:8080
In OpenWebUI: Settings → Connections → OpenAI API
Set:
- API Base URL: http://localhost:8080/v1
- API Key: not-needed (any non-empty value)
Save and refresh — all models from your llama-swap config appear in the model dropdown

Docker Compose Example

version: "3.8"
services:
  llama-swap:
    image: ghcr.io/mostlygeek/llama-swap:latest
    volumes:
      - ./config.yaml:/config.yaml
      - /path/to/models:/models
    ports:
      - "8080:8080"
    command: ["--config", "/config.yaml"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  openwebui:
    image: ghcr.io/open-webui/open-webui:main
    environment:
      OPENAI_API_BASE_URL: "http://llama-swap:8080/v1"
      OPENAI_API_KEY: "not-needed"
    ports:
      - "3000:8080"
    depends_on:
      - llama-swap

LangChain

pip install langchain-openai

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.messages import HumanMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Chat model
chat = ChatOpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
    model="llama3",
    temperature=0.7
)

# Embeddings model
embeddings = OpenAIEmbeddings(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
    model="nomic-embed"
)

# Simple chain
chain = ChatPromptTemplate.from_template("Explain {topic} briefly") | chat | StrOutputParser()
print(chain.invoke({"topic": "async/await"}))

# RAG pipeline — embeddings auto-switch to nomic-embed, chat to llama3
from langchain_community.vectorstores import FAISS
from langchain_core.runnables import RunnablePassthrough

texts = ["Python is interpreted", "Rust uses a borrow checker", "Go has goroutines"]
vectorstore = FAISS.from_texts(texts, embeddings)  # uses nomic-embed
retriever = vectorstore.as_retriever()

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | ChatPromptTemplate.from_template("Context: {context}\nQuestion: {question}")
    | chat   # uses llama3
    | StrOutputParser()
)
print(rag_chain.invoke("What language has goroutines?"))

llama-index

pip install llama-index llama-index-llms-openai llama-index-embeddings-openai

from llama_index.llms.openai import OpenAI as LlamaOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader

Settings.llm = LlamaOpenAI(
    model="llama3",
    api_base="http://localhost:8080/v1",
    api_key="not-needed"
)

Settings.embed_model = OpenAIEmbedding(
    model="nomic-embed",
    api_base="http://localhost:8080/v1",
    api_key="not-needed"
)

docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs)     # uses nomic-embed automatically
engine = index.as_query_engine()
print(engine.query("What is the main topic?"))    # uses llama3 automatically

Continue.dev (VS Code AI Extension)

Add llama-swap as a provider in ~/.continue/config.json:

{
  "models": [
    {
      "title": "Llama 3.1 8B",
      "provider": "openai",
      "model": "llama3",
      "apiBase": "http://localhost:8080/v1",
      "apiKey": "not-needed"
    },
    {
      "title": "DeepSeek Coder",
      "provider": "openai",
      "model": "deepseek-coder",
      "apiBase": "http://localhost:8080/v1",
      "apiKey": "not-needed"
    }
  ],
  "tabAutocompleteModel": {
    "title": "DeepSeek Coder",
    "provider": "openai",
    "model": "deepseek-coder",
    "apiBase": "http://localhost:8080/v1",
    "apiKey": "not-needed"
  },
  "embeddingsProvider": {
    "provider": "openai",
    "model": "nomic-embed",
    "apiBase": "http://localhost:8080/v1",
    "apiKey": "not-needed"
  }
}

Aider (AI Pair Programming)

pip install aider-chat
aider \
  --openai-api-base http://localhost:8080/v1 \
  --openai-api-key not-needed \
  --model openai/deepseek-coder

curl Examples

# Test model routing: chat with llama3
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama3","messages":[{"role":"user","content":"Hello"}],"max_tokens":20}'

# Switch to coding model
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"deepseek-coder","messages":[{"role":"user","content":"Write fizzbuzz in Python"}]}'

# Embed
curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"nomic-embed","input":"Hello world"}'

Ollama API Compatibility

llama-swap does not natively expose the Ollama API (/api/chat, /api/generate). If your tool requires the Ollama API format, use litellm as a translation layer:

pip install litellm
litellm --model openai/llama3 --api_base http://localhost:8080/v1

Environment Variable Configuration

For applications that read API config from environment variables:

export OPENAI_API_BASE="http://localhost:8080/v1"
export OPENAI_API_KEY="not-needed"

# Many tools (openai CLI, LangChain, etc.) auto-read these
openai chat.completions.create -m llama3 -q "Hello"

AI

VAD

ASR

TTS

llama-swap

llama.cpp

Embedded Sytems

EDK2-UEFI

U-Boot

Yocto

QEMU

QNX

AUTOSAR Adaptive

MISRA C++

ASIL

ASPICE

DevOps

Conan

Artifactory

Jenkins