Articulate·Local Models Overview Definition Goal Models Benchmarks GLM-5.2 Routing Proof Tests Decisions Next Client page →

Local Models v1.3

Living record · v1.3 · built from the live Mac Studio filesystem on 2026-06-19. Every number measured, not reported. · Client-facing proof & pitch →

Articulate's on-prem inference layer: open models running on the Mac Studio at $0 per token, for work that can't leave the box (PDPL / client-sensitive) and for bulk mechanical volume where cloud per-token adds up. This page tracks what's installed, how fast it really is, who routes to it, and what's next.

Project definition

Articulate's on-premises inference layer. Open-weight models (via Ollama, fronted by the LiteLLM "Agency" gateway) running on hardware we control, at zero marginal cost, for two purposes:

The Mac Studio is the reference rig and showroom — what we demo — not the client production target.

Goal

Not "$0 inference" as an end in itself. Local models earn their place on two axes:

The metric we optimise: reliable, client-shippable output per unit of Anthony's attention. The stack must multiply judgement, not consume it — every model, route and test below is justified against that, or it's decoration.

4
Models live
~28 GB on Ollama
82.5
Top eval tok/s
gemma4:26b, measured
36 GB
Studio RAM
M4 Max · the hard ceiling
$0
Marginal cost
local inference

The host

Everything below runs on one machine — which is also the production media box, so heavy runs wait for the stack to be quiet.

AttributeValue
MachineMac Studio · M4 Max · 36 GB unified memory · macOS 26.3
ReachTailnet 100.82.41.69 · LAN 192.168.1.246 · ssh anthonybooth
Internal disk460 GiB SSD · ~259 GiB free (fast, but small)
External diskCrucial X9 Pro /Volumes/2TB · ~1.6 TiB free (roomy, ~2 GB/s)
RuntimesOllama (API 127.0.0.1:11434) · LiteLLM gateway "The Agency" on :4000
Also hostsPlex / *arr / Transmission · dashboard collector · ntfy · villa wifi monitor

Models stored locally

Ollama store at ~/.ollama/models (~28 GB). Benchmark = eval tokens/sec on a 120-word generation, run 2026-06-19.

ModelParamsSizePulledEval tok/sLoadRole
gemma4:26b26B17 GB19 Jun82.59.6sHeaviest; fastest gen. Not yet wired into gateway.
qwen3:8b8B5.2 GB~5 Jun63.75.8sThe triager. Wired in LiteLLM.
qwen3-vl:8b8B (VL)6.1 GB~5 Jun64.85.8sVision / screenshots. Wired.
nomic-embed-text274 MB~5 JunEmbeddings → ~/vector-store (4 collections). Wired.

All four fit comfortably in 36 GB. gemma4:26b at ~17 GB is the heaviest and runs fine — and, notably, generates faster than the 8B models (no thinking-token overhead).

Benchmarks (measured 2026-06-19)

Eval rate — generation throughput. Higher is better. One battery, short prompt; treat as a baseline, not a leaderboard.

gemma4:26b
82.5 tok/s
qwen3-vl:8b
64.8 tok/s
qwen3:8b
63.7 tok/s
Full timing detail
qwen3:8b     load 5.79s  prompt-eval 360 t/s  eval 63.7 t/s (350 tok)
qwen3-vl:8b  load 5.80s  prompt-eval 301 t/s  eval 64.8 t/s (400 tok)
gemma4:26b   load 9.61s  prompt-eval 111 t/s  eval 82.5 t/s (868 tok)
Prompt: "Write a concise 120-word paragraph on why local AI matters for data privacy."
Runner: Ollama @127.0.0.1:11434 on Mac Studio M4 Max / macOS 26.3.

GLM-5.2 — attempted, parked

Status: download failed at 17%, partial files removed. Target was unsloth/GLM-5.2-GGUF, 2-bit dynamic quant UD-IQ2_M (6 shards, ~238 GB) to /Volumes/2TB. The downloader reached 1 of 6 files (~37 GB), then exited; the directory is no longer on disk. Only the HF hub ref stub remains.

Why it's parked, not retried: a ~238 GB model cannot fit the Studio's 36 GB RAM. Even as an MoE mmap'd off SSD it would be batch-only and slow; the internal SSD (~259 GB free) is too tight to host it safely, and its speed only partly offsets the RAM gap. The data-sovereign rationale is real — but it needs a 256 GB+ box, not this one.

The pragmatic call: the capability is already reachable — glm-5 is wired in the LiteLLM gateway via OpenRouter today. Use that for now. Re-attempt the local 238 GB build only if/when hardware with the memory headroom exists.

Routing — "The Agency" gateway

One LiteLLM endpoint (:4000) sits in front of every model. Per-agent virtual keys give per-agent cost tracking + budgets (Prisma/Postgres backed). Config: ~/litellm-config.yaml.

TierModelsWhenCost
T1 localqwen3-8b, qwen3-vl-8b, nomic-embed · qwen3-14b configured but not pulledTriage, embeddings, vision, PDPL-sensitive + bulk volume$0
T2 cloud-openglm-5, kimi-k2 (via OpenRouter)Open-weight capability without self-hostingper-token
T3 paid frontierclaude-sonnet-4-6, claude-opus-4-6, gpt-5, gemini-2.5-proTop-tier deliverables, hard reasoning, tool-useper-token

Which agent uses which model — why & when

Job / agentModelWhyState
Hype Radar scoringqwen3:8b → escalateCheap $0 triage of the feed; hard P/Q/S calls escalate uplive
Embeddings / vector-storenomic-embed-textLocal semantic index, 4 collections, no data egresslive
Vision / screenshot readsqwen3-vl:8bImage understanding on-boxlive
Local hard reasoninggemma4:26bFastest local; the new "hard call" candidateto wire
Client deliverables / reasoningclaude sonnet / opusQuality per token; keep top work on frontierconfig
Data-sovereign openglm-5 / kimi-k2Open-weight, cloud-routed; local GLM parkedconfig

Doctrine (toolbox canon): cheap triage → daily Sonnet → Opus for the hard stuff; local $0 absorbs the mechanical and the PDPL-sensitive volume.

Usage by day

Honest state: Ollama keeps no request log, and the LiteLLM spend ledger (Prisma/Postgres) is live but holds little history — the stack was assembled in the last ~2 weeks. Below is what's verifiable today; per-agent token/$ per day lands in v1.2 via a LiteLLM exporter.

DateEvent
~5 JunPulled qwen3:8b, qwen3-vl:8b, nomic-embed-text; LiteLLM gateway stood up
19 JunPulled gemma4:26b; first benchmark battery run; GLM-5.2 download attempted (failed at 17%)

v1.2 plan: set DATABASE_URL so LiteLLM per-agent budgets/spend actually record (Prisma engine is already running), then export tokens + $ per agent per day into this page. That turns "usage by day" from acquisition events into real call volume.

Proof — a real PDPL task, run on-box

Measured 2026-06-19 on the Studio. Synthetic data only — no real client records.

gemma4:26b summarised a UAE motor-insurance claim and flagged all five PII items for redaction — Emirates ID, mobile, vehicle plate, IBAN, policyholder name — entirely on the box. No cloud call, no API key, runs with the network off.

LegWhat was shownMeasured
SovereigntyInference via 127.0.0.1:11434, no API key, Ollama runs offline0 external calls by design
CapabilityClaim summary (3 bullets) + complete PII redaction list5/5 identifiers caught
Performancegemma4:26b (25.8B, Q4_K_M, 262k ctx)81.3 tok/s eval · ~0.3s warm load · $0

Caveat (measured, not hidden): gemma4 emits its reasoning in the raw stream, so a client pipeline suppresses or post-formats that. Substance was correct. → the client-facing version of this proof.

Test regime (determined)

Three suites. Replaces opinion about "how good is local" with a number.

SuiteWhat it doesCadence
SpeedFixed 5-prompt battery per model → tok/s, load, prompt-eval. Catches drift after macOS/Ollama updates.Weekly, automated (launchd)
Quality15–20 real Articulate task cards across redaction/triage, summarisation, extraction, classification, brand-voice, RAG — scored local vs cloud against a rubric. The "is local good enough for this job?" answer.On every model add/swap
SovereigntyAssert the inference path is local-only: 127.0.0.1, no key, answers offline.Attached to every proof

Enabler: the LiteLLM spend ledger (Postgres :5432 + prisma engine — both verified running) is gated but on. Turning on DATABASE_URL capture makes per-agent tokens/$ per day real, which feeds the quality suite's cost column.

Decision log — this build

DecisionWhy
GLM-5.2 local parked238GB can't fit 36GB RAM; download died at 17%. Use glm-5 via the gateway; revisit local only with a 256GB+ box.
Goal set: sovereignty + cost floor, not "$0 for its own sake"Local models earn their keep as a sellable PDPL differentiator and a way to reserve frontier spend for where it moves the needle.
Studio = reference rig, not client productionIt's the showroom you demo, then deploy on the client's own infra.
gemma4:26b → promote to local "hard call" slotBenchmarked fastest (82 tok/s) and passed the redaction proof. Not yet wired into the gateway.

Next ones to try

Messages & notifications

Flag for review. The GLM downloader (/tmp/glm_dl.py) had a completion hook that, on finish, reads credentials.md, regex-extracts a Telegram bot token, and POSTs a "download complete" message to a hard-coded chat ID (8727847893). It never fired (the download died). Confirm that chat ID is yours, and prefer reading a single named secret over grepping the whole creds file.

Working notifier: self-hosted ntfy on the Studio (topic studio-hype, port 8090) carries the media-pulse monitor.

Paths reference

Ollama store     ~/.ollama/models            (~28 GB)
Ollama binary    /Applications/Ollama.app/Contents/Resources/ollama
Ollama API       http://127.0.0.1:11434
LiteLLM gateway  ~/litellm-config.yaml  ·  :4000  ·  venv ~/litellm-venv
Vector store     ~/vector-store              (nomic-embed, 4 collections)
GLM target       /Volumes/2TB/glm-5.2/UD-IQ2_M   (removed — download failed)
Dashboard        ~/dashboard/collect.py      (launchd com.user.dashboard-collect)
SSH key          vault: Claude/sandbox-keys/id_ed25519  ·  [email protected]

Local Models v1.3 · Articulate · built 2026-06-19 from the live Studio filesystem. Numbers measured via Ollama --verbose and direct SSH probes. Not hand-estimated. · Client proof & pitch →