News · Local Models

DiffusionGemma: 4x faster open text model

An open-weights model from DeepMind takes the word-by-word bottleneck out of local AI — at the cost of some reliability on short replies.

R
RAR Editor
Published June 2026 · 5 min read
The Quick Version
  • Google DeepMind released DiffusionGemma on 10 June 2026 — a new open text model that drafts whole blocks of words at once, not one word at a time
  • It uses the same approach as image generators, applied to text — which makes it much faster on a single local machine
  • Independent testing reports it running roughly four times faster than a standard local model on common hardware
  • It's free to download and run on your own machine, with no per-word bill once it's set up
  • It struggles on short replies and makes more mistakes than standard models, so test on real work before swapping it in
DiffusionGemma: 4x faster open text model

Photo: Google · Press image · via Google

Google DeepMind released an experimental open text model called DiffusionGemma on 10 June 2026. NVIDIA says the model can run up to four times faster than a standard local model on the same single-user hardware, and the licence means no per-word bill once it is running on your own kit. For a UK small firm using AI on a single office workstation — running a chat assistant, an internal tool, or an automated workflow — that gap is the difference between an assistant that feels responsive and one that crawls.

How the model works

Almost every widely used AI writes one word at a time. DiffusionGemma works differently: it drafts a whole block of text in one pass and refines it, the same approach image generators use to turn noise into a picture. On a single user running a model on their own machine, a standard AI spends most of its time waiting between words; the graphics hardware is fast but sits idle in the gaps. DiffusionGemma gives the hardware a block of work to chew on in parallel, and that is what NVIDIA and Ars Technica both measured when they tested the release. The practical effect is a model that once felt sluggish on a single high-end card now feels closer to a cloud-grade service, with no per-word bill on top.

the speed of an equivalent word-by-word model on the same single-user hardware, per NVIDIA’s own benchmark

What the sources say

  • Google’s announcement (10 June 2026): positions DiffusionGemma for speed-critical local work — in-line editing, code completion, rapid iteration, and other non-linear text tasks. Standard Gemma 4 remains the recommendation where maximum output quality is the priority.
  • NVIDIA: reports the model runs roughly four times faster than an equivalent word-by-word model on the same single-user setup, with optimisation across consumer cards and the DGX range.
  • Ars Technica: tested the release on a high-end gaming graphics card and confirmed the speed boost. Flagged the higher error rate and the cost on short replies.
  • Simon Willison: ran the free hosted demo and recorded a comparable throughput on a typical text task.

Two trade-offs the sources flag

Both Ars Technica and NVIDIA name the same two costs in their coverage.

First, short replies are expensive. The model still has to refine a full block even when the user wants a one-line answer, so a quick classification or a 20-word summary is faster on the older word-by-word approach. Second, the error rate is higher. A single badly placed word can corrupt the whole block and force a fresh attempt — a problem image generators can absorb but text cannot. Google’s own post labels DiffusionGemma experimental and recommends standard Gemma 4 where reliability is the priority.

What to try, and what to weigh up

For a UK small firm, the appeal is simple: the work that used to crawl on a desktop AI box gets closer to interactive, and the model fits the kind of tasks Google’s announcement calls out — in-line editing, code completion, redrafting where the model revises a block rather than starting from the left. It also pairs with the multi-step AI assistants we have covered before, where a tool-using helper calls an API, reads the result, and writes a summary across several passes.

Two paths to try it. If you have a recent high-end NVIDIA graphics card, pull the model weights from Hugging Face and run them through Hugging Face Transformers, vLLM or Unsloth — all three are supported from day one. If you would rather not commit the kit, NVIDIA is hosting the model free on its NIM API for a quick comparison test. For the wider picture on running local models on a single workstation, our Gemma 4 on your own hardware guide and the LM Studio vs Ollama comparison are still the right starting points.

Where to be careful. Run any pilot on a non-customer-facing workflow for a week. Both NVIDIA and Ars Technica flag a higher error rate than the standard Gemma on the same hardware, and a real cost on short replies. For one-line classifications, the older word-by-word models still win. For longer, block-shaped jobs where you were already running a local model, the trade is worth testing on your own data first.

Our view: for a firm paying cloud rates for a chat-shaped workload that a single office box could run, DiffusionGemma is the most interesting local release in months. Run a one-week pilot, compare the error rate against your current setup, and only swap it in once you have seen real numbers on your own data.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any 1 to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

  1. DiffusionGemma: 4x faster text generation — Google
  2. NVIDIA Accelerates Google DeepMind's DiffusionGemma for Local AI
  3. DiffusionGemma
  4. Google's latest DiffusionGemma open AI model comes with a 4x speed boost
Filed under News · Local Models

Continue Reading