Integration Guide
About 841 wordsAbout 3 min
2026-03-21
llama-swap presents a standard OpenAI-compatible API, which means any tool or library that works with OpenAI's API works with llama-swap — just change the base_url to point to your llama-swap instance.
OpenAI Python SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed" # required by client library, ignored by llama-swap
)
# Chat
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Explain merge sort"}],
temperature=0.7,
max_tokens=512
)
print(response.choices[0].message.content)
# Stream
stream = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Write a Rust hello world"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
# Embeddings
embedding = client.embeddings.create(
model="nomic-embed",
input="Query text to embed"
)
vector = embedding.data[0].embeddingSwitching Models in One Application
With llama-swap, the same client can seamlessly use different models:
def chat(model: str, prompt: str) -> str:
return client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=512
).choices[0].message.content
# These run with automatic model swapping
code = chat("deepseek-coder", "Write a Python binary search function")
summary = chat("llama3", f"Explain this code: {code}")
vector = client.embeddings.create(model="nomic-embed", input=summary).data[0].embeddingOpenWebUI
OpenWebUI is a popular self-hosted chat interface. Connect it to llama-swap to expose all your local models through a single UI.
Setup
- Start llama-swap on
http://localhost:8080 - In OpenWebUI: Settings → Connections → OpenAI API
- Set:
- API Base URL:
http://localhost:8080/v1 - API Key:
not-needed(any non-empty value)
- API Base URL:
- Save and refresh — all models from your llama-swap config appear in the model dropdown
Docker Compose Example
version: "3.8"
services:
llama-swap:
image: ghcr.io/mostlygeek/llama-swap:latest
volumes:
- ./config.yaml:/config.yaml
- /path/to/models:/models
ports:
- "8080:8080"
command: ["--config", "/config.yaml"]
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
openwebui:
image: ghcr.io/open-webui/open-webui:main
environment:
OPENAI_API_BASE_URL: "http://llama-swap:8080/v1"
OPENAI_API_KEY: "not-needed"
ports:
- "3000:8080"
depends_on:
- llama-swapLangChain
pip install langchain-openaifrom langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.messages import HumanMessage
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Chat model
chat = ChatOpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
model="llama3",
temperature=0.7
)
# Embeddings model
embeddings = OpenAIEmbeddings(
base_url="http://localhost:8080/v1",
api_key="not-needed",
model="nomic-embed"
)
# Simple chain
chain = ChatPromptTemplate.from_template("Explain {topic} briefly") | chat | StrOutputParser()
print(chain.invoke({"topic": "async/await"}))
# RAG pipeline — embeddings auto-switch to nomic-embed, chat to llama3
from langchain_community.vectorstores import FAISS
from langchain_core.runnables import RunnablePassthrough
texts = ["Python is interpreted", "Rust uses a borrow checker", "Go has goroutines"]
vectorstore = FAISS.from_texts(texts, embeddings) # uses nomic-embed
retriever = vectorstore.as_retriever()
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| ChatPromptTemplate.from_template("Context: {context}\nQuestion: {question}")
| chat # uses llama3
| StrOutputParser()
)
print(rag_chain.invoke("What language has goroutines?"))llama-index
pip install llama-index llama-index-llms-openai llama-index-embeddings-openaifrom llama_index.llms.openai import OpenAI as LlamaOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
Settings.llm = LlamaOpenAI(
model="llama3",
api_base="http://localhost:8080/v1",
api_key="not-needed"
)
Settings.embed_model = OpenAIEmbedding(
model="nomic-embed",
api_base="http://localhost:8080/v1",
api_key="not-needed"
)
docs = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs) # uses nomic-embed automatically
engine = index.as_query_engine()
print(engine.query("What is the main topic?")) # uses llama3 automaticallyContinue.dev (VS Code AI Extension)
Add llama-swap as a provider in ~/.continue/config.json:
{
"models": [
{
"title": "Llama 3.1 8B",
"provider": "openai",
"model": "llama3",
"apiBase": "http://localhost:8080/v1",
"apiKey": "not-needed"
},
{
"title": "DeepSeek Coder",
"provider": "openai",
"model": "deepseek-coder",
"apiBase": "http://localhost:8080/v1",
"apiKey": "not-needed"
}
],
"tabAutocompleteModel": {
"title": "DeepSeek Coder",
"provider": "openai",
"model": "deepseek-coder",
"apiBase": "http://localhost:8080/v1",
"apiKey": "not-needed"
},
"embeddingsProvider": {
"provider": "openai",
"model": "nomic-embed",
"apiBase": "http://localhost:8080/v1",
"apiKey": "not-needed"
}
}Aider (AI Pair Programming)
pip install aider-chat
aider \
--openai-api-base http://localhost:8080/v1 \
--openai-api-key not-needed \
--model openai/deepseek-codercurl Examples
# Test model routing: chat with llama3
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama3","messages":[{"role":"user","content":"Hello"}],"max_tokens":20}'
# Switch to coding model
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"deepseek-coder","messages":[{"role":"user","content":"Write fizzbuzz in Python"}]}'
# Embed
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model":"nomic-embed","input":"Hello world"}'Ollama API Compatibility
llama-swap does not natively expose the Ollama API (/api/chat, /api/generate). If your tool requires the Ollama API format, use litellm as a translation layer:
pip install litellm
litellm --model openai/llama3 --api_base http://localhost:8080/v1Environment Variable Configuration
For applications that read API config from environment variables:
export OPENAI_API_BASE="http://localhost:8080/v1"
export OPENAI_API_KEY="not-needed"
# Many tools (openai CLI, LangChain, etc.) auto-read these
openai chat.completions.create -m llama3 -q "Hello"See Also
- API Reference — all endpoints and request/response formats
- Configuration — setting up models for use with these integrations
- Model Management — keeping embedding models persistent for low-latency RAG
- llama.cpp Python Bindings — alternative to llama-swap for single-process Python apps