Google DeepMind released an experimental open text model called DiffusionGemma on 10 June 2026. NVIDIA says the model can run up to four times faster than a standard local model on the same single-user hardware, and the licence means no per-word bill once it is running on your own kit. For a UK small firm using AI on a single office workstation — running a chat assistant, an internal tool, or an automated workflow — that gap is the difference between an assistant that feels responsive and one that crawls.
How the model works
Almost every widely used AI writes one word at a time. DiffusionGemma works differently: it drafts a whole block of text in one pass and refines it, the same approach image generators use to turn noise into a picture. On a single user running a model on their own machine, a standard AI spends most of its time waiting between words; the graphics hardware is fast but sits idle in the gaps. DiffusionGemma gives the hardware a block of work to chew on in parallel, and that is what NVIDIA and Ars Technica both measured when they tested the release. The practical effect is a model that once felt sluggish on a single high-end card now feels closer to a cloud-grade service, with no per-word bill on top.
4×the speed of an equivalent word-by-word model on the same single-user hardware, per NVIDIA’s own benchmark
What the sources say
- Google’s announcement (10 June 2026): positions DiffusionGemma for speed-critical local work — in-line editing, code completion, rapid iteration, and other non-linear text tasks. Standard Gemma 4 remains the recommendation where maximum output quality is the priority.
- NVIDIA: reports the model runs roughly four times faster than an equivalent word-by-word model on the same single-user setup, with optimisation across consumer cards and the DGX range.
- Ars Technica: tested the release on a high-end gaming graphics card and confirmed the speed boost. Flagged the higher error rate and the cost on short replies.
- Simon Willison: ran the free hosted demo and recorded a comparable throughput on a typical text task.
Two trade-offs the sources flag
Both Ars Technica and NVIDIA name the same two costs in their coverage.
First, short replies are expensive. The model still has to refine a full block even when the user wants a one-line answer, so a quick classification or a 20-word summary is faster on the older word-by-word approach. Second, the error rate is higher. A single badly placed word can corrupt the whole block and force a fresh attempt — a problem image generators can absorb but text cannot. Google’s own post labels DiffusionGemma experimental and recommends standard Gemma 4 where reliability is the priority.
What to try, and what to weigh up
For a UK small firm, the appeal is simple: the work that used to crawl on a desktop AI box gets closer to interactive, and the model fits the kind of tasks Google’s announcement calls out — in-line editing, code completion, redrafting where the model revises a block rather than starting from the left. It also pairs with the multi-step AI assistants we have covered before, where a tool-using helper calls an API, reads the result, and writes a summary across several passes.
Two paths to try it. If you have a recent high-end NVIDIA graphics card, pull the model weights from Hugging Face and run them through Hugging Face Transformers, vLLM or Unsloth — all three are supported from day one. If you would rather not commit the kit, NVIDIA is hosting the model free on its NIM API for a quick comparison test. For the wider picture on running local models on a single workstation, our Gemma 4 on your own hardware guide and the LM Studio vs Ollama comparison are still the right starting points.
Where to be careful. Run any pilot on a non-customer-facing workflow for a week. Both NVIDIA and Ars Technica flag a higher error rate than the standard Gemma on the same hardware, and a real cost on short replies. For one-line classifications, the older word-by-word models still win. For longer, block-shaped jobs where you were already running a local model, the trade is worth testing on your own data first.
Our view: for a firm paying cloud rates for a chat-shaped workload that a single office box could run, DiffusionGemma is the most interesting local release in months. Run a one-week pilot, compare the error rate against your current setup, and only swap it in once you have seen real numbers on your own data.
Sources & quotes
Every quotation in this article is verbatim from a named source — click any 1 to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →


