Tooling · Local Runtimes

Ollama v0.24 adds the Codex app and gets faster on Apple Silicon

May 2026's Ollama updates look like housekeeping. For a solo operator running models on a MacBook, they quietly remove the friction that makes local AI feel like hard work.

R
RAR Editor
Published May 2026 · 5 min read
The Quick Version
  • v0.24.0 (14 May 2026) adds `ollama launch codex-app` to open OpenAI's desktop Codex experience.
  • A reworked MLX sampler improves generation quality on Apple Silicon Macs.
  • Cached `/api/show` responses give roughly 6.7x lower median latency in tools like VS Code.
  • Gemma 4 MTP speculative decoding on Mac delivers over a 2x speed-up on the Gemma 4 31B model for coding.

Runtime release notes are where good news goes to be ignored. Nobody screenshots a changelog. But if you are a sole trader running models on a MacBook — a consultant drafting client work offline, an accountant who would rather sensitive numbers never touched a cloud — the unglamorous updates are the ones that actually change your day. Ollama’s May 2026 run, capped by v0.24.0 on 14 May, is exactly that kind of release.

What actually shipped

Ollama moved through v0.23.0 to v0.24.0 across May 2026, and the headline addition is a new launcher. The command ollama launch codex-app opens OpenAI’s desktop Codex experience directly from the runtime you already have installed, so you are not stitching together a separate install just to get a coding-assistant front end onto your machine.

The rest of the release is plumbing, and that is the point:

  • A reworked MLX sampler improves generation quality on Apple Silicon — the M-series chips most solo Mac users are running.
  • Cached /api/show responses cut median latency on integrations such as VS Code by roughly 6.7x, which is the call your editor fires constantly while it works out what a model can do.
  • Gemma 4 MTP speculative decoding on Mac delivers, on Ollama’s own testing, over a 2x speed increase on the Gemma 4 31B model for coding tasks.
6.7xlower median latency on /api/show calls after caching — the request editors like VS Code make repeatedly while probing a model’s capabilities.

Treat the speed figures as the maker’s reported, tested claims rather than a guarantee for your exact setup. A 31B model is also a lot to ask of a laptop; the speed-up matters most if you have the unified memory to run it comfortably in the first place. Smaller Gemma variants remain the sensible default for a single machine, and the honest test is always your own work, not a published benchmark.

Speculative decoding, for the curious, is the trick doing the heavy lifting here: a small, fast model drafts several tokens ahead and the larger model verifies them in one pass, so you get the bigger model’s quality at closer to the smaller model’s speed. You do not have to understand the mechanism to benefit from it — Ollama turns it on for you — but it explains why the gains show up most on longer coding generations rather than one-line replies.

Why a solo operator should care

The case for local inference has always been privacy and cost: nothing leaves your laptop, and once the hardware is bought, each query is effectively free. The catch has always been friction. Setup felt fiddly, editor integrations stuttered, and the experience never quite matched a polished cloud product.

These releases chip away at exactly that. The Codex launcher means one command instead of a setup session. The MLX sampler improvement matters because both Ollama and its rivals increasingly lean on MLX as the Apple-Silicon backend — better sampling there lifts everything you generate. And the cached /api/show latency win is the kind of fix you feel rather than read about: your editor stops hanging on the small stuff.

The point of a good runtime upgrade is that you stop noticing the runtime. May’s release pushes Ollama further in that direction on the Mac most sole traders already own.

Ollama is widely treated as the default local LLM runner — the CLI-and-API tool builders reach for first. Keeping it current is low effort:

# Update, then confirm the version
ollama --version

# Pull a Gemma 4 build and try the new Codex launcher
ollama pull gemma3:latest
ollama launch codex-app

What this means for a small UK team

If you are a one-person practice or a small professional-services outfit running models locally, the takeaway is simple: keep your runtime current and let the maintainers do the optimising for you. You did not have to change your hardware, rewrite a workflow, or learn anything new to get a snappier editor, better output on a Mac, and a one-command path to a Codex front end.

Update Ollama, and if you have the memory headroom, test whether a Gemma 4 build plus speculative decoding genuinely speeds up your coding work — time a real task before and after rather than trusting the benchmark. For sensitive client work that should never leave the building, the gap between “local and a bit clunky” and “local and pleasant to use” is closing, one quiet release at a time.

Filed under Tooling · Runtimes

Continue Reading