
Running Local Inference on the New Gemma Models — From Departmental Hardware
How small teams are deploying quantised Gemma models on commodity GPUs to run private, offline pipelines. No cloud, no data leaving the building.
11 pieces on local inference — practical workflows, case studies and field notes.

How small teams are deploying quantised Gemma models on commodity GPUs to run private, offline pipelines. No cloud, no data leaving the building.

Stanford's Hazy Research has shipped the first credible open-source framework for personal AI agents that run on your own hardware. For UK operators, local-first has stopped being a manifesto and started being a curl command.

Anthropic just dropped Claude Fable 5 into the $20 tier and MiniMax M3 matches it on agentic work. For a small team, the value question has quietly flipped.

The UK is preparing its first fully sovereign frontier AI model, with startup Cosine leading and a roster of major British firms on design. Here's why data residency and procurement confidence are the real story.

A frontier-grade model with open weights, a million-token context window and native multimodality. For small teams, it reframes what is possible without a per-seat cloud contract — if you can find the hardware.

Gemma 4 adds built-in tool calling and vision support, and Ollama now runs it fully. For a retail team, that means document, shelf and stock workflows that never send an image to the cloud.

A 27B model that reportedly tops consumer-hardware leaderboards and fits in a single 24GB card at Q4. For a sole trader or a small professional-services team, that is the sweet spot worth understanding.

Both run open models on your own hardware. The right pick has less to do with benchmarks than with who on your team will actually be using it.

AMD's software stack spent years as the awkward alternative to NVIDIA. In 2026 it is a credible cost play for a back-office team — provided you check a few things first.

Meta's Llama 4 Scout brings a ten-million-token context window into the open. For logistics and data-heavy teams, the real question is what a window that big is — and isn't — actually good for.
May 2026's runtime updates look like housekeeping. For a solo operator running models on a MacBook, they quietly remove some of the friction that makes local AI feel like hard work.
We use privacy-friendly analytics to learn which articles are useful — no ads, no data selling. Cookies are only set if you accept. More