Native, Apple Silicon–only local LLM server. Similar to Ollama, but built on Apple's MLX for maximum performance on M‑series chips. SwiftUI app + SwiftNIO server with OpenAI‑compatible endpoints.
Created by Dinoki Labs (dinoki.ai), a fully native desktop AI assistant and companion.
- Native MLX runtime: Optimized for Apple Silicon using MLX/MLXLLM
- Apple Silicon only: Designed and tested for M‑series Macs
- OpenAI API compatible:
/v1/modelsand/v1/chat/completions(stream and non‑stream) - Function/Tool calling: OpenAI‑style
tools+tool_choice, withtool_callsparsing and streaming deltas - Chat templates: Delegates templating to MLX
ChatSessionfor model‑native formatting - Session reuse (KV cache): Faster multi‑turn chats via
session_id - Fast token streaming: Server‑Sent Events for low‑latency output
- Model manager UI: Browse, download, and manage MLX models from
mlx-community - System resource monitor: Real-time CPU and RAM usage visualization
- Self‑contained: SwiftUI app with an embedded SwiftNIO HTTP server
- macOS 15.5+
- Apple Silicon (M1 or newer)
- Xcode 16.4+ (to build from source)
osaurus/
├── Core/
│ ├── AppDelegate.swift
│ └── osaurusApp.swift
├── Controllers/
│ ├── ServerController.swift # NIO server lifecycle
│ └── ModelManager.swift # Model discovery & downloads (Hugging Face)
├── Models/
│ ├── MLXModel.swift
│ ├── OpenAIAPI.swift # OpenAI‑compatible DTOs
│ ├── ServerConfiguration.swift
│ └── ServerHealth.swift
├── Networking/
│ ├── HTTPHandler.swift # Request parsing & routing entry
│ ├── Router.swift # Routes → handlers
│ └── AsyncHTTPHandler.swift # SSE streaming for chat completions
├── Services/
│ ├── MLXService.swift # MLX loading, session caching, generation
│ ├── SearchService.swift
│ └── SystemMonitorService.swift # Real-time CPU and RAM monitoring
├── Theme/
│ └── Theme.swift
├── Views/
│ ├── Components/SimpleComponents.swift
│ ├── ContentView.swift # Start/stop server, quick controls
│ └── ModelDownloadView.swift # Browse/download/manage models
└── Assets.xcassets/
- Native MLX text generation with model session caching
- Model manager with curated suggestions (Llama, Qwen, Gemma, Mistral, etc.)
- Download sizes estimated via Hugging Face metadata
- Streaming and non‑streaming chat completions
- OpenAI‑compatible function calling with robust parser for model outputs (handles code fences/formatting noise)
- Chat templating handled by MLX
ChatSessionusing the model's configuration - Session reuse across turns via
session_id(reuses KV cache when possible) - Auto‑detects stop sequences and BOS token from tokenizer configs
- Health endpoint and simple status UI
- Real-time system resource monitoring
The following are 20-run averages from our batch benchmark suite. See raw results for details and variance.
| Server | Model | TTFT avg (ms) | Total avg (ms) | Chars/s avg | Success |
|---|---|---|---|---|---|
| Osaurus | llama-3.2-3b-instruct-4bit | 191 | 1461 | 521 | 100% |
| Ollama | llama3.2 | 59 | 1667 | 439 | 100% |
| LM Studio | llama-3.2-3b-instruct | 56 | 1205 | 605 | 100% |
- Metrics: TTFT = time-to-first-token, Total = time to final token, Chars/s = streaming throughput.
- Data sources:
results/osaurus-vs-ollama-lmstudio-batch.summary.json,results/osaurus-vs-ollama-lmstudio-batch.results.csv. - How to reproduce:
scripts/run_bench.shcallsscripts/benchmark_models.pyto run prompts across servers and write results.
GET /→ Plain text statusGET /health→ JSON health infoGET /modelsandGET /v1/models→ OpenAI‑compatible models listPOST /chat/completionsandPOST /v1/chat/completions→ OpenAI‑compatible chat completions
Download the latest signed build from the Releases page.
- Open
osaurus.xcodeprojin Xcode 16.4+ - Build and run the
osaurustarget - In the UI, configure the port via the gear icon (default
8080) and press Start - Open the model manager to download a model (e.g., "Llama 3.2 3B Instruct 4bit")
Models are stored by default at ~/Documents/MLXModels. Override with the environment variable OSU_MODELS_DIR.
Base URL: http://127.0.0.1:8080 (or your chosen port)
List models:
curl -s http://127.0.0.1:8080/v1/models | jqNon‑streaming chat completion:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [{"role":"user","content":"Write a haiku about dinosaurs"}],
"max_tokens": 200
}'Streaming chat completion (SSE):
curl -N http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [{"role":"user","content":"Summarize Jurassic Park in one paragraph"}],
"stream": true
}'Tip: Model names are lower‑cased with hyphens (derived from the friendly name), for example: Llama 3.2 3B Instruct 4bit → llama-3.2-3b-instruct-4bit.
Osaurus supports OpenAI‑style function calling. Send tools and optional tool_choice in your request. The model is instructed to reply with an exact JSON object containing tool_calls, and the server parses it, including common formatting like code fences.
Define tools and let the model decide (tool_choice: "auto"):
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [
{"role":"system","content":"You can call functions to answer queries succinctly."},
{"role":"user","content":"What\'s the weather in SF?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather by city name",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}
],
"tool_choice": "auto"
}'Non‑stream response will include message.tool_calls and finish_reason: "tool_calls". Streaming responses emit OpenAI‑style deltas for tool_calls (id, type, function name, and chunked arguments), finishing with finish_reason: "tool_calls" and [DONE].
After you execute a tool, continue the conversation by sending a tool role message with tool_call_id:
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [
{"role":"user","content":"What\'s the weather in SF?"},
{"role":"assistant","content":"","tool_calls":[{"id":"call_1","type":"function","function":{"name":"get_weather","arguments":"{\"city\":\"SF\"}"}}]},
{"role":"tool","tool_call_id":"call_1","content":"{\"tempC\":18,\"conditions\":\"Foggy\"}"}
]
}'Notes:
- Only
type: "function"tools are supported. - Arguments must be a JSON‑escaped string in the assistant response; Osaurus also tolerates a nested
parametersobject and will normalize. - Parser accepts minor formatting noise like code fences and
assistant:prefixes.
Osaurus relies on MLX ChatSession to apply the appropriate chat template for each model. System messages are passed as instructions; user content is fed via respond/streamResponse. This keeps prompts aligned with model‑native formatting and avoids double‑templating.
For faster multi‑turn conversations, you can reuse a chat session’s KV cache by providing session_id in your request. When possible (and not concurrently in use), Osaurus will reuse the session for the same model to reduce latency and cost.
curl -s http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"session_id": "my-session-1",
"messages": [
{"role":"user","content":"Tell me a fact about stegosaurs"}
]
}'Notes:
- Sessions are opportunistically reused for a short window and only when not actively used by another request.
- Keep
session_idstable per ongoing conversation and per model.
Point your client at Osaurus and use any placeholder API key.
Python example:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="osaurus")
resp = client.chat.completions.create(
model="llama-3.2-3b-instruct-4bit",
messages=[{"role": "user", "content": "Hello there!"}],
)
print(resp.choices[0].message.content)Python with tools (non‑stream):
import json
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="osaurus")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather by city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}
]
resp = client.chat.completions.create(
model="llama-3.2-3b-instruct-4bit",
messages=[{"role": "user", "content": "Weather in SF?"}],
tools=tools,
tool_choice="auto",
)
tool_calls = resp.choices[0].message.tool_calls or []
for call in tool_calls:
args = json.loads(call.function.arguments)
result = {"tempC": 18, "conditions": "Foggy"} # your tool result
followup = client.chat.completions.create(
model="llama-3.2-3b-instruct-4bit",
messages=[
{"role": "user", "content": "Weather in SF?"},
{"role": "assistant", "content": "", "tool_calls": tool_calls},
{"role": "tool", "tool_call_id": call.id, "content": json.dumps(result)},
],
)
print(followup.choices[0].message.content)- Curated suggestions include Llama, Qwen, Gemma, Mistral, Phi, DeepSeek, etc. (4‑bit variants for speed)
- Discovery pulls from Hugging Face
mlx-communityand computes size estimates - Required files are fetched automatically (tokenizer/config/weights)
- Change the models directory with
OSU_MODELS_DIR
- Apple Silicon only (requires MLX); Intel Macs are not supported
- Localhost only, no authentication; put behind a proxy if exposing externally
/transcribeendpoints are placeholders pending Whisper integration
- SwiftNIO (HTTP server)
- SwiftUI/AppKit (UI)
- MLX‑Swift, MLXLLM (runtime and chat session)
- wizardeur — first PR creator
- Join us on Discord
- Read the Contributing Guide and our Code of Conduct
- See our Security Policy for reporting vulnerabilities
- Get help in Support
- Pick up a good first issue or help wanted
If you find Osaurus useful, please ⭐ the repo and share it!