Skip to content
forked from dinoki-ai/osaurus

Native, Apple Silicon–only local LLM server. Similar to Ollama, but built on Apple's MLX for maximum performance on M‑series chips. SwiftUI app + SwiftNIO server with OpenAI‑compatible endpoints.

License

Notifications You must be signed in to change notification settings

Uebermut/osaurus

 
 

Osaurus 🦕

Release Downloads License Stars Platform OpenAI API PRs Welcome

Screenshot 2025-08-24 at 4 47 41 PM

Native, Apple Silicon–only local LLM server. Similar to Ollama, but built on Apple's MLX for maximum performance on M‑series chips. SwiftUI app + SwiftNIO server with OpenAI‑compatible endpoints.

Created by Dinoki Labs (dinoki.ai), a fully native desktop AI assistant and companion.

Highlights

  • Native MLX runtime: Optimized for Apple Silicon using MLX/MLXLLM
  • Apple Silicon only: Designed and tested for M‑series Macs
  • OpenAI API compatible: /v1/models and /v1/chat/completions (stream and non‑stream)
  • Function/Tool calling: OpenAI‑style tools + tool_choice, with tool_calls parsing and streaming deltas
  • Chat templates: Delegates templating to MLX ChatSession for model‑native formatting
  • Session reuse (KV cache): Faster multi‑turn chats via session_id
  • Fast token streaming: Server‑Sent Events for low‑latency output
  • Model manager UI: Browse, download, and manage MLX models from mlx-community
  • System resource monitor: Real-time CPU and RAM usage visualization
  • Self‑contained: SwiftUI app with an embedded SwiftNIO HTTP server

Requirements

  • macOS 15.5+
  • Apple Silicon (M1 or newer)
  • Xcode 16.4+ (to build from source)
osaurus/
├── Core/
│   ├── AppDelegate.swift
│   └── osaurusApp.swift
├── Controllers/
│   ├── ServerController.swift      # NIO server lifecycle
│   └── ModelManager.swift          # Model discovery & downloads (Hugging Face)
├── Models/
│   ├── MLXModel.swift
│   ├── OpenAIAPI.swift             # OpenAI‑compatible DTOs
│   ├── ServerConfiguration.swift
│   └── ServerHealth.swift
├── Networking/
│   ├── HTTPHandler.swift           # Request parsing & routing entry
│   ├── Router.swift                # Routes → handlers
│   └── AsyncHTTPHandler.swift      # SSE streaming for chat completions
├── Services/
│   ├── MLXService.swift            # MLX loading, session caching, generation
│   ├── SearchService.swift
│   └── SystemMonitorService.swift  # Real-time CPU and RAM monitoring
├── Theme/
│   └── Theme.swift
├── Views/
│   ├── Components/SimpleComponents.swift
│   ├── ContentView.swift           # Start/stop server, quick controls
│   └── ModelDownloadView.swift     # Browse/download/manage models
└── Assets.xcassets/

Features

  • Native MLX text generation with model session caching
  • Model manager with curated suggestions (Llama, Qwen, Gemma, Mistral, etc.)
  • Download sizes estimated via Hugging Face metadata
  • Streaming and non‑streaming chat completions
  • OpenAI‑compatible function calling with robust parser for model outputs (handles code fences/formatting noise)
  • Chat templating handled by MLX ChatSession using the model's configuration
  • Session reuse across turns via session_id (reuses KV cache when possible)
  • Auto‑detects stop sequences and BOS token from tokenizer configs
  • Health endpoint and simple status UI
  • Real-time system resource monitoring

Benchmarks

The following are 20-run averages from our batch benchmark suite. See raw results for details and variance.

Server Model TTFT avg (ms) Total avg (ms) Chars/s avg Success
Osaurus llama-3.2-3b-instruct-4bit 191 1461 521 100%
Ollama llama3.2 59 1667 439 100%
LM Studio llama-3.2-3b-instruct 56 1205 605 100%
  • Metrics: TTFT = time-to-first-token, Total = time to final token, Chars/s = streaming throughput.
  • Data sources: results/osaurus-vs-ollama-lmstudio-batch.summary.json, results/osaurus-vs-ollama-lmstudio-batch.results.csv.
  • How to reproduce: scripts/run_bench.sh calls scripts/benchmark_models.py to run prompts across servers and write results.

API Endpoints

  • GET / → Plain text status
  • GET /health → JSON health info
  • GET /models and GET /v1/models → OpenAI‑compatible models list
  • POST /chat/completions and POST /v1/chat/completions → OpenAI‑compatible chat completions

Getting Started

Download

Download the latest signed build from the Releases page.

Build and run

  1. Open osaurus.xcodeproj in Xcode 16.4+
  2. Build and run the osaurus target
  3. In the UI, configure the port via the gear icon (default 8080) and press Start
  4. Open the model manager to download a model (e.g., "Llama 3.2 3B Instruct 4bit")

Models are stored by default at ~/Documents/MLXModels. Override with the environment variable OSU_MODELS_DIR.

Use the API

Base URL: http://127.0.0.1:8080 (or your chosen port)

List models:

curl -s http://127.0.0.1:8080/v1/models | jq

Non‑streaming chat completion:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [{"role":"user","content":"Write a haiku about dinosaurs"}],
        "max_tokens": 200
      }'

Streaming chat completion (SSE):

curl -N http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [{"role":"user","content":"Summarize Jurassic Park in one paragraph"}],
        "stream": true
      }'

Tip: Model names are lower‑cased with hyphens (derived from the friendly name), for example: Llama 3.2 3B Instruct 4bitllama-3.2-3b-instruct-4bit.

Function/Tool Calling (OpenAI‑compatible)

Osaurus supports OpenAI‑style function calling. Send tools and optional tool_choice in your request. The model is instructed to reply with an exact JSON object containing tool_calls, and the server parses it, including common formatting like code fences.

Define tools and let the model decide (tool_choice: "auto"):

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [
          {"role":"system","content":"You can call functions to answer queries succinctly."},
          {"role":"user","content":"What\'s the weather in SF?"}
        ],
        "tools": [
          {
            "type": "function",
            "function": {
              "name": "get_weather",
              "description": "Get weather by city name",
              "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
              }
            }
          }
        ],
        "tool_choice": "auto"
      }'

Non‑stream response will include message.tool_calls and finish_reason: "tool_calls". Streaming responses emit OpenAI‑style deltas for tool_calls (id, type, function name, and chunked arguments), finishing with finish_reason: "tool_calls" and [DONE].

After you execute a tool, continue the conversation by sending a tool role message with tool_call_id:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [
          {"role":"user","content":"What\'s the weather in SF?"},
          {"role":"assistant","content":"","tool_calls":[{"id":"call_1","type":"function","function":{"name":"get_weather","arguments":"{\"city\":\"SF\"}"}}]},
          {"role":"tool","tool_call_id":"call_1","content":"{\"tempC\":18,\"conditions\":\"Foggy\"}"}
        ]
      }'

Notes:

  • Only type: "function" tools are supported.
  • Arguments must be a JSON‑escaped string in the assistant response; Osaurus also tolerates a nested parameters object and will normalize.
  • Parser accepts minor formatting noise like code fences and assistant: prefixes.

Chat Templates

Osaurus relies on MLX ChatSession to apply the appropriate chat template for each model. System messages are passed as instructions; user content is fed via respond/streamResponse. This keeps prompts aligned with model‑native formatting and avoids double‑templating.

Session reuse (KV cache)

For faster multi‑turn conversations, you can reuse a chat session’s KV cache by providing session_id in your request. When possible (and not concurrently in use), Osaurus will reuse the session for the same model to reduce latency and cost.

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "session_id": "my-session-1",
        "messages": [
          {"role":"user","content":"Tell me a fact about stegosaurs"}
        ]
      }'

Notes:

  • Sessions are opportunistically reused for a short window and only when not actively used by another request.
  • Keep session_id stable per ongoing conversation and per model.

Use with OpenAI SDKs

Point your client at Osaurus and use any placeholder API key.

Python example:

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="osaurus")

resp = client.chat.completions.create(
    model="llama-3.2-3b-instruct-4bit",
    messages=[{"role": "user", "content": "Hello there!"}],
)

print(resp.choices[0].message.content)

Python with tools (non‑stream):

import json
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="osaurus")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather by city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }
]

resp = client.chat.completions.create(
    model="llama-3.2-3b-instruct-4bit",
    messages=[{"role": "user", "content": "Weather in SF?"}],
    tools=tools,
    tool_choice="auto",
)

tool_calls = resp.choices[0].message.tool_calls or []
for call in tool_calls:
    args = json.loads(call.function.arguments)
    result = {"tempC": 18, "conditions": "Foggy"}  # your tool result
    followup = client.chat.completions.create(
        model="llama-3.2-3b-instruct-4bit",
        messages=[
            {"role": "user", "content": "Weather in SF?"},
            {"role": "assistant", "content": "", "tool_calls": tool_calls},
            {"role": "tool", "tool_call_id": call.id, "content": json.dumps(result)},
        ],
    )
    print(followup.choices[0].message.content)

Models

  • Curated suggestions include Llama, Qwen, Gemma, Mistral, Phi, DeepSeek, etc. (4‑bit variants for speed)
  • Discovery pulls from Hugging Face mlx-community and computes size estimates
  • Required files are fetched automatically (tokenizer/config/weights)
  • Change the models directory with OSU_MODELS_DIR

Notes & Limitations

  • Apple Silicon only (requires MLX); Intel Macs are not supported
  • Localhost only, no authentication; put behind a proxy if exposing externally
  • /transcribe endpoints are placeholders pending Whisper integration

Dependencies

  • SwiftNIO (HTTP server)
  • SwiftUI/AppKit (UI)
  • MLX‑Swift, MLXLLM (runtime and chat session)

Contributors

Community

If you find Osaurus useful, please ⭐ the repo and share it!

About

Native, Apple Silicon–only local LLM server. Similar to Ollama, but built on Apple's MLX for maximum performance on M‑series chips. SwiftUI app + SwiftNIO server with OpenAI‑compatible endpoints.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Swift 86.8%
  • Shell 6.9%
  • Python 6.3%