Local Models v1.3 — Articulate's on-prem inference stack

Articulate's on-prem inference layer: open models running on the Mac Studio at $0 per token, for work that can't leave the box (PDPL / client-sensitive) and for bulk mechanical volume where cloud per-token adds up. This page tracks what's installed, how fast it really is, who routes to it, and what's next.

Project definition

Articulate's on-premises inference layer. Open-weight models (via Ollama, fronted by the LiteLLM "Agency" gateway) running on hardware we control, at zero marginal cost, for two purposes:

Internal capability — handles sensitive and high-volume work without sending data to a US API.
Productised offer — the data-sovereign stack we install on a client's own infrastructure, owned by them at handover.

The Mac Studio is the reference rig and showroom — what we demo — not the client production target.

Goal

Not "$0 inference" as an end in itself. Local models earn their place on two axes:

Differentiator — a sellable data-sovereignty edge for regulated UAE/GCC buyers: PDPL-clean, "your data never leaves your jurisdiction."
Cost floor — absorb the sensitive, high-volume grind at $0 so frontier spend is reserved for the few jobs where it changes the output.

The metric we optimise: reliable, client-shippable output per unit of Anthony's attention. The stack must multiply judgement, not consume it — every model, route and test below is justified against that, or it's decoration.

Models live

~28 GB on Ollama

82.5

Top eval tok/s

gemma4:26b, measured

36 GB

Studio RAM

M4 Max · the hard ceiling

Marginal cost

local inference

The host

Everything below runs on one machine — which is also the production media box, so heavy runs wait for the stack to be quiet.

Attribute	Value
Machine	Mac Studio · M4 Max · 36 GB unified memory · macOS 26.3
Reach	Tailnet `100.82.41.69` · LAN `192.168.1.246` · ssh `anthonybooth`
Internal disk	460 GiB SSD · ~259 GiB free (fast, but small)
External disk	Crucial X9 Pro `/Volumes/2TB` · ~1.6 TiB free (roomy, ~2 GB/s)
Runtimes	Ollama (API `127.0.0.1:11434`) · LiteLLM gateway "The Agency" on `:4000`
Also hosts	Plex / *arr / Transmission · dashboard collector · ntfy · villa wifi monitor

Models stored locally

Ollama store at ~/.ollama/models (~28 GB). Benchmark = eval tokens/sec on a 120-word generation, run 2026-06-19.

Model	Params	Size	Pulled	Eval tok/s	Load	Role
gemma4:26b	26B	17 GB	19 Jun	82.5	9.6s	Heaviest; fastest gen. Not yet wired into gateway.
qwen3:8b	8B	5.2 GB	~5 Jun	63.7	5.8s	The triager. Wired in LiteLLM.
qwen3-vl:8b	8B (VL)	6.1 GB	~5 Jun	64.8	5.8s	Vision / screenshots. Wired.
nomic-embed-text	—	274 MB	~5 Jun	—	—	Embeddings → `~/vector-store` (4 collections). Wired.

All four fit comfortably in 36 GB. gemma4:26b at ~17 GB is the heaviest and runs fine — and, notably, generates faster than the 8B models (no thinking-token overhead).

Benchmarks (measured 2026-06-19)

Eval rate — generation throughput. Higher is better. One battery, short prompt; treat as a baseline, not a leaderboard.

gemma4:26b

82.5 tok/s

qwen3-vl:8b

64.8 tok/s

qwen3:8b

63.7 tok/s

Full timing detail

qwen3:8b     load 5.79s  prompt-eval 360 t/s  eval 63.7 t/s (350 tok)
qwen3-vl:8b  load 5.80s  prompt-eval 301 t/s  eval 64.8 t/s (400 tok)
gemma4:26b   load 9.61s  prompt-eval 111 t/s  eval 82.5 t/s (868 tok)
Prompt: "Write a concise 120-word paragraph on why local AI matters for data privacy."
Runner: Ollama @127.0.0.1:11434 on Mac Studio M4 Max / macOS 26.3.

GLM-5.2 — attempted, parked

Status: download failed at 17%, partial files removed. Target was unsloth/GLM-5.2-GGUF, 2-bit dynamic quant UD-IQ2_M (6 shards, ~238 GB) to /Volumes/2TB. The downloader reached 1 of 6 files (~37 GB), then exited; the directory is no longer on disk. Only the HF hub ref stub remains.

Why it's parked, not retried: a ~238 GB model cannot fit the Studio's 36 GB RAM. Even as an MoE mmap'd off SSD it would be batch-only and slow; the internal SSD (~259 GB free) is too tight to host it safely, and its speed only partly offsets the RAM gap. The data-sovereign rationale is real — but it needs a 256 GB+ box, not this one.

The pragmatic call: the capability is already reachable — glm-5 is wired in the LiteLLM gateway via OpenRouter today. Use that for now. Re-attempt the local 238 GB build only if/when hardware with the memory headroom exists.

Routing — "The Agency" gateway

One LiteLLM endpoint (:4000) sits in front of every model. Per-agent virtual keys give per-agent cost tracking + budgets (Prisma/Postgres backed). Config: ~/litellm-config.yaml.

Tier	Models	When	Cost
T1 local	qwen3-8b, qwen3-vl-8b, nomic-embed · qwen3-14b configured but not pulled	Triage, embeddings, vision, PDPL-sensitive + bulk volume	$0
T2 cloud-open	glm-5, kimi-k2 (via OpenRouter)	Open-weight capability without self-hosting	per-token
T3 paid frontier	claude-sonnet-4-6, claude-opus-4-6, gpt-5, gemini-2.5-pro	Top-tier deliverables, hard reasoning, tool-use	per-token

Which agent uses which model — why & when

Job / agent	Model	Why	State
Hype Radar scoring	qwen3:8b → escalate	Cheap $0 triage of the feed; hard P/Q/S calls escalate up	live
Embeddings / vector-store	nomic-embed-text	Local semantic index, 4 collections, no data egress	live
Vision / screenshot reads	qwen3-vl:8b	Image understanding on-box	live
Local hard reasoning	gemma4:26b	Fastest local; the new "hard call" candidate	to wire
Client deliverables / reasoning	claude sonnet / opus	Quality per token; keep top work on frontier	config
Data-sovereign open	glm-5 / kimi-k2	Open-weight, cloud-routed; local GLM parked	config

Doctrine (toolbox canon): cheap triage → daily Sonnet → Opus for the hard stuff; local $0 absorbs the mechanical and the PDPL-sensitive volume.

Usage by day

Honest state: Ollama keeps no request log, and the LiteLLM spend ledger (Prisma/Postgres) is live but holds little history — the stack was assembled in the last ~2 weeks. Below is what's verifiable today; per-agent token/$ per day lands in v1.2 via a LiteLLM exporter.

Date	Event
~5 Jun	Pulled qwen3:8b, qwen3-vl:8b, nomic-embed-text; LiteLLM gateway stood up
19 Jun	Pulled gemma4:26b; first benchmark battery run; GLM-5.2 download attempted (failed at 17%)

v1.2 plan: set DATABASE_URL so LiteLLM per-agent budgets/spend actually record (Prisma engine is already running), then export tokens + $ per agent per day into this page. That turns "usage by day" from acquisition events into real call volume.

Proof — a real PDPL task, run on-box

Measured 2026-06-19 on the Studio. Synthetic data only — no real client records.

gemma4:26b summarised a UAE motor-insurance claim and flagged all five PII items for redaction — Emirates ID, mobile, vehicle plate, IBAN, policyholder name — entirely on the box. No cloud call, no API key, runs with the network off.

Leg	What was shown	Measured
Sovereignty	Inference via `127.0.0.1:11434`, no API key, Ollama runs offline	0 external calls by design
Capability	Claim summary (3 bullets) + complete PII redaction list	5/5 identifiers caught
Performance	gemma4:26b (25.8B, Q4_K_M, 262k ctx)	81.3 tok/s eval · ~0.3s warm load · $0

Caveat (measured, not hidden): gemma4 emits its reasoning in the raw stream, so a client pipeline suppresses or post-formats that. Substance was correct. → the client-facing version of this proof.

Test regime (determined)

Three suites. Replaces opinion about "how good is local" with a number.

Suite	What it does	Cadence
Speed	Fixed 5-prompt battery per model → tok/s, load, prompt-eval. Catches drift after macOS/Ollama updates.	Weekly, automated (launchd)
Quality	15–20 real Articulate task cards across redaction/triage, summarisation, extraction, classification, brand-voice, RAG — scored local vs cloud against a rubric. The "is local good enough for this job?" answer.	On every model add/swap
Sovereignty	Assert the inference path is local-only: `127.0.0.1`, no key, answers offline.	Attached to every proof

Enabler: the LiteLLM spend ledger (Postgres :5432 + prisma engine — both verified running) is gated but on. Turning on DATABASE_URL capture makes per-agent tokens/$ per day real, which feeds the quality suite's cost column.

Decision log — this build

Decision	Why
GLM-5.2 local parked	238GB can't fit 36GB RAM; download died at 17%. Use `glm-5` via the gateway; revisit local only with a 256GB+ box.
Goal set: sovereignty + cost floor, not "$0 for its own sake"	Local models earn their keep as a sellable PDPL differentiator and a way to reserve frontier spend for where it moves the needle.
Studio = reference rig, not client production	It's the showroom you demo, then deploy on the client's own infra.
gemma4:26b → promote to local "hard call" slot	Benchmarked fastest (82 tok/s) and passed the redaction proof. Not yet wired into the gateway.

Next ones to try

qwen3:14b — already in the gateway config but not installed. Fills the mid gap between 8B and gemma4:26b; fits 36 GB easily. Quick win.
Wire gemma4:26b into LiteLLM — it benchmarked fastest; promote it to the local "hard reasoning" slot.
A local coding model (e.g. a qwen-coder class) — for code tasks that shouldn't hit cloud.
MLX runtime — Apple-native inference vs Ollama; worth a head-to-head on the M4 Max for tok/s.
GLM-5.2 local — parked until a 256 GB+ box exists. Use glm-5 cloud meanwhile.
Turn on LiteLLM spend DB — unlocks real per-agent usage + budgets.

Messages & notifications

Flag for review. The GLM downloader (/tmp/glm_dl.py) had a completion hook that, on finish, reads credentials.md, regex-extracts a Telegram bot token, and POSTs a "download complete" message to a hard-coded chat ID (8727847893). It never fired (the download died). Confirm that chat ID is yours, and prefer reading a single named secret over grepping the whole creds file.

Working notifier: self-hosted ntfy on the Studio (topic studio-hype, port 8090) carries the media-pulse monitor.

Paths reference

Ollama store     ~/.ollama/models            (~28 GB)
Ollama binary    /Applications/Ollama.app/Contents/Resources/ollama
Ollama API       http://127.0.0.1:11434
LiteLLM gateway  ~/litellm-config.yaml  ·  :4000  ·  venv ~/litellm-venv
Vector store     ~/vector-store              (nomic-embed, 4 collections)
GLM target       /Volumes/2TB/glm-5.2/UD-IQ2_M   (removed — download failed)
Dashboard        ~/dashboard/collect.py      (launchd com.user.dashboard-collect)
SSH key          vault: Claude/sandbox-keys/id_ed25519  ·  [email protected]

Local Models v1.3 · Articulate · built 2026-06-19 from the live Studio filesystem. Numbers measured via Ollama --verbose and direct SSH probes. Not hand-estimated. · Client proof & pitch →