Living record · v1.3 · built from the live Mac Studio filesystem on 2026-06-19. Every number measured, not reported. · Client-facing proof & pitch →
Articulate's on-prem inference layer: open models running on the Mac Studio at $0 per token, for work that can't leave the box (PDPL / client-sensitive) and for bulk mechanical volume where cloud per-token adds up. This page tracks what's installed, how fast it really is, who routes to it, and what's next.
Articulate's on-premises inference layer. Open-weight models (via Ollama, fronted by the LiteLLM "Agency" gateway) running on hardware we control, at zero marginal cost, for two purposes:
The Mac Studio is the reference rig and showroom — what we demo — not the client production target.
Not "$0 inference" as an end in itself. Local models earn their place on two axes:
The metric we optimise: reliable, client-shippable output per unit of Anthony's attention. The stack must multiply judgement, not consume it — every model, route and test below is justified against that, or it's decoration.
Everything below runs on one machine — which is also the production media box, so heavy runs wait for the stack to be quiet.
| Attribute | Value |
|---|---|
| Machine | Mac Studio · M4 Max · 36 GB unified memory · macOS 26.3 |
| Reach | Tailnet 100.82.41.69 · LAN 192.168.1.246 · ssh anthonybooth |
| Internal disk | 460 GiB SSD · ~259 GiB free (fast, but small) |
| External disk | Crucial X9 Pro /Volumes/2TB · ~1.6 TiB free (roomy, ~2 GB/s) |
| Runtimes | Ollama (API 127.0.0.1:11434) · LiteLLM gateway "The Agency" on :4000 |
| Also hosts | Plex / *arr / Transmission · dashboard collector · ntfy · villa wifi monitor |
Ollama store at ~/.ollama/models (~28 GB). Benchmark = eval tokens/sec on a 120-word generation, run 2026-06-19.
| Model | Params | Size | Pulled | Eval tok/s | Load | Role |
|---|---|---|---|---|---|---|
| gemma4:26b | 26B | 17 GB | 19 Jun | 82.5 | 9.6s | Heaviest; fastest gen. Not yet wired into gateway. |
| qwen3:8b | 8B | 5.2 GB | ~5 Jun | 63.7 | 5.8s | The triager. Wired in LiteLLM. |
| qwen3-vl:8b | 8B (VL) | 6.1 GB | ~5 Jun | 64.8 | 5.8s | Vision / screenshots. Wired. |
| nomic-embed-text | — | 274 MB | ~5 Jun | — | — | Embeddings → ~/vector-store (4 collections). Wired. |
All four fit comfortably in 36 GB. gemma4:26b at ~17 GB is the heaviest and runs fine — and, notably, generates faster than the 8B models (no thinking-token overhead).
Eval rate — generation throughput. Higher is better. One battery, short prompt; treat as a baseline, not a leaderboard.
qwen3:8b load 5.79s prompt-eval 360 t/s eval 63.7 t/s (350 tok) qwen3-vl:8b load 5.80s prompt-eval 301 t/s eval 64.8 t/s (400 tok) gemma4:26b load 9.61s prompt-eval 111 t/s eval 82.5 t/s (868 tok) Prompt: "Write a concise 120-word paragraph on why local AI matters for data privacy." Runner: Ollama @127.0.0.1:11434 on Mac Studio M4 Max / macOS 26.3.
Status: download failed at 17%, partial files removed. Target was unsloth/GLM-5.2-GGUF, 2-bit dynamic quant UD-IQ2_M (6 shards, ~238 GB) to /Volumes/2TB. The downloader reached 1 of 6 files (~37 GB), then exited; the directory is no longer on disk. Only the HF hub ref stub remains.
Why it's parked, not retried: a ~238 GB model cannot fit the Studio's 36 GB RAM. Even as an MoE mmap'd off SSD it would be batch-only and slow; the internal SSD (~259 GB free) is too tight to host it safely, and its speed only partly offsets the RAM gap. The data-sovereign rationale is real — but it needs a 256 GB+ box, not this one.
The pragmatic call: the capability is already reachable — glm-5 is wired in the LiteLLM gateway via OpenRouter today. Use that for now. Re-attempt the local 238 GB build only if/when hardware with the memory headroom exists.
One LiteLLM endpoint (:4000) sits in front of every model. Per-agent virtual keys give per-agent cost tracking + budgets (Prisma/Postgres backed). Config: ~/litellm-config.yaml.
| Tier | Models | When | Cost |
|---|---|---|---|
| T1 local | qwen3-8b, qwen3-vl-8b, nomic-embed · qwen3-14b configured but not pulled | Triage, embeddings, vision, PDPL-sensitive + bulk volume | $0 |
| T2 cloud-open | glm-5, kimi-k2 (via OpenRouter) | Open-weight capability without self-hosting | per-token |
| T3 paid frontier | claude-sonnet-4-6, claude-opus-4-6, gpt-5, gemini-2.5-pro | Top-tier deliverables, hard reasoning, tool-use | per-token |
| Job / agent | Model | Why | State |
|---|---|---|---|
| Hype Radar scoring | qwen3:8b → escalate | Cheap $0 triage of the feed; hard P/Q/S calls escalate up | live |
| Embeddings / vector-store | nomic-embed-text | Local semantic index, 4 collections, no data egress | live |
| Vision / screenshot reads | qwen3-vl:8b | Image understanding on-box | live |
| Local hard reasoning | gemma4:26b | Fastest local; the new "hard call" candidate | to wire |
| Client deliverables / reasoning | claude sonnet / opus | Quality per token; keep top work on frontier | config |
| Data-sovereign open | glm-5 / kimi-k2 | Open-weight, cloud-routed; local GLM parked | config |
Doctrine (toolbox canon): cheap triage → daily Sonnet → Opus for the hard stuff; local $0 absorbs the mechanical and the PDPL-sensitive volume.
Honest state: Ollama keeps no request log, and the LiteLLM spend ledger (Prisma/Postgres) is live but holds little history — the stack was assembled in the last ~2 weeks. Below is what's verifiable today; per-agent token/$ per day lands in v1.2 via a LiteLLM exporter.
| Date | Event |
|---|---|
| ~5 Jun | Pulled qwen3:8b, qwen3-vl:8b, nomic-embed-text; LiteLLM gateway stood up |
| 19 Jun | Pulled gemma4:26b; first benchmark battery run; GLM-5.2 download attempted (failed at 17%) |
v1.2 plan: set DATABASE_URL so LiteLLM per-agent budgets/spend actually record (Prisma engine is already running), then export tokens + $ per agent per day into this page. That turns "usage by day" from acquisition events into real call volume.
Measured 2026-06-19 on the Studio. Synthetic data only — no real client records.
gemma4:26b summarised a UAE motor-insurance claim and flagged all five PII items for redaction — Emirates ID, mobile, vehicle plate, IBAN, policyholder name — entirely on the box. No cloud call, no API key, runs with the network off.
| Leg | What was shown | Measured |
|---|---|---|
| Sovereignty | Inference via 127.0.0.1:11434, no API key, Ollama runs offline | 0 external calls by design |
| Capability | Claim summary (3 bullets) + complete PII redaction list | 5/5 identifiers caught |
| Performance | gemma4:26b (25.8B, Q4_K_M, 262k ctx) | 81.3 tok/s eval · ~0.3s warm load · $0 |
Caveat (measured, not hidden): gemma4 emits its reasoning in the raw stream, so a client pipeline suppresses or post-formats that. Substance was correct. → the client-facing version of this proof.
Three suites. Replaces opinion about "how good is local" with a number.
| Suite | What it does | Cadence |
|---|---|---|
| Speed | Fixed 5-prompt battery per model → tok/s, load, prompt-eval. Catches drift after macOS/Ollama updates. | Weekly, automated (launchd) |
| Quality | 15–20 real Articulate task cards across redaction/triage, summarisation, extraction, classification, brand-voice, RAG — scored local vs cloud against a rubric. The "is local good enough for this job?" answer. | On every model add/swap |
| Sovereignty | Assert the inference path is local-only: 127.0.0.1, no key, answers offline. | Attached to every proof |
Enabler: the LiteLLM spend ledger (Postgres :5432 + prisma engine — both verified running) is gated but on. Turning on DATABASE_URL capture makes per-agent tokens/$ per day real, which feeds the quality suite's cost column.
| Decision | Why |
|---|---|
| GLM-5.2 local parked | 238GB can't fit 36GB RAM; download died at 17%. Use glm-5 via the gateway; revisit local only with a 256GB+ box. |
| Goal set: sovereignty + cost floor, not "$0 for its own sake" | Local models earn their keep as a sellable PDPL differentiator and a way to reserve frontier spend for where it moves the needle. |
| Studio = reference rig, not client production | It's the showroom you demo, then deploy on the client's own infra. |
| gemma4:26b → promote to local "hard call" slot | Benchmarked fastest (82 tok/s) and passed the redaction proof. Not yet wired into the gateway. |
glm-5 cloud meanwhile.Flag for review. The GLM downloader (/tmp/glm_dl.py) had a completion hook that, on finish, reads credentials.md, regex-extracts a Telegram bot token, and POSTs a "download complete" message to a hard-coded chat ID (8727847893). It never fired (the download died). Confirm that chat ID is yours, and prefer reading a single named secret over grepping the whole creds file.
Working notifier: self-hosted ntfy on the Studio (topic studio-hype, port 8090) carries the media-pulse monitor.
Ollama store ~/.ollama/models (~28 GB) Ollama binary /Applications/Ollama.app/Contents/Resources/ollama Ollama API http://127.0.0.1:11434 LiteLLM gateway ~/litellm-config.yaml · :4000 · venv ~/litellm-venv Vector store ~/vector-store (nomic-embed, 4 collections) GLM target /Volumes/2TB/glm-5.2/UD-IQ2_M (removed — download failed) Dashboard ~/dashboard/collect.py (launchd com.user.dashboard-collect) SSH key vault: Claude/sandbox-keys/id_ed25519 · [email protected]
Local Models v1.3 · Articulate · built 2026-06-19 from the live Studio filesystem. Numbers measured via Ollama --verbose and direct SSH probes. Not hand-estimated. · Client proof & pitch →