DiffusionGemma: 4x faster open text model

Google DeepMind released an experimental open text model called DiffusionGemma on 10 June 2026. NVIDIA says the model can run up to four times faster than a standard local model on the same single-user hardware, and the licence means no per-word bill once it is running on your own kit. For a UK small firm using AI on a single office workstation — running a chat assistant, an internal tool, or an automated workflow — that gap is the difference between an assistant that feels responsive and one that crawls.

How the model works

Almost every widely used AI writes one word at a time. DiffusionGemma works differently: it drafts a whole block of text in one pass and refines it, the same approach image generators use to turn noise into a picture. On a single user running a model on their own machine, a standard AI spends most of its time waiting between words; the graphics hardware is fast but sits idle in the gaps. DiffusionGemma gives the hardware a block of work to chew on in parallel, and that is what NVIDIA and Ars Technica both measured when they tested the release. The practical effect is a model that once felt sluggish on a single high-end card now feels closer to a cloud-grade service, with no per-word bill on top.

4×the speed of an equivalent word-by-word model on the same single-user hardware, per NVIDIA’s own benchmark

What the sources say

Google’s announcement (10 June 2026): positions DiffusionGemma for speed-critical local work — in-line editing, code completion, rapid iteration, and other non-linear text tasks. Standard Gemma 4 remains the recommendation where maximum output quality is the priority.
NVIDIA: reports the model runs roughly four times faster than an equivalent word-by-word model on the same single-user setup, with optimisation across consumer cards and the DGX range.
Ars Technica: tested the release on a high-end gaming graphics card and confirmed the speed boost. Flagged the higher error rate and the cost on short replies.
Simon Willison: ran the free hosted demo and recorded a comparable throughput on a typical text task.

Two trade-offs the sources flag

Both Ars Technica and NVIDIA name the same two costs in their coverage.

First, short replies are expensive. The model still has to refine a full block even when the user wants a one-line answer, so a quick classification or a 20-word summary is faster on the older word-by-word approach. Second, the error rate is higher. A single badly placed word can corrupt the whole block and force a fresh attempt — a problem image generators can absorb but text cannot. Google’s own post labels DiffusionGemma experimental and recommends standard Gemma 4 where reliability is the priority.

DiffusionGemma is a 26-billion-parameter mixture-of-experts build of Gemma 4 that activates roughly 3.8 billion parameters per step. It refines up to 256 tokens in parallel on each pass. A quantised build fits in around 18GB of graphics-card memory.

Reported throughput, all from the sources cited above: roughly 1,000 tokens/second on a single NVIDIA H100; around 700 tokens/second for a quantised build on a single GeForce RTX 5090; around 150 tokens/second on an NVIDIA DGX Spark; up to 2,000 tokens/second on an NVIDIA DGX Station; and at least 500 tokens/second on NVIDIA’s hosted free demo, measured by Simon Willison.

Support is available from day one in Hugging Face Transformers, vLLM and Unsloth, with MLX, NVIDIA NeMo and llama.cpp following. NVIDIA worked with Google DeepMind on optimisation across GeForce RTX 5090 and 4090 cards, RTX PRO workstations, DGX Spark, DGX Station and the H100, including native support for NVFP4, a four-bit floating-point format. The model is open under Apache 2.0.

What to try, and what to weigh up

For a UK small firm, the appeal is simple: the work that used to crawl on a desktop AI box gets closer to interactive, and the model fits the kind of tasks Google’s announcement calls out — in-line editing, code completion, redrafting where the model revises a block rather than starting from the left. It also pairs with the multi-step AI assistants we have covered before, where a tool-using helper calls an API, reads the result, and writes a summary across several passes.

Two paths to try it. If you have a recent high-end NVIDIA graphics card, pull the model weights from Hugging Face and run them through Hugging Face Transformers, vLLM or Unsloth — all three are supported from day one. If you would rather not commit the kit, NVIDIA is hosting the model free on its NIM API for a quick comparison test. For the wider picture on running local models on a single workstation, our Gemma 4 on your own hardware guide and the LM Studio vs Ollama comparison are still the right starting points.

Where to be careful. Run any pilot on a non-customer-facing workflow for a week. Both NVIDIA and Ars Technica flag a higher error rate than the standard Gemma on the same hardware, and a real cost on short replies. For one-line classifications, the older word-by-word models still win. For longer, block-shaped jobs where you were already running a local model, the trade is worth testing on your own data first.

Our view: for a firm paying cloud rates for a chat-shaped workload that a single office box could run, DiffusionGemma is the most interesting local release in months. Run a one-week pilot, compare the error rate against your current setup, and only swap it in once you have seen real numbers on your own data.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any ¹ to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

Filed under News · Local Models

DiffusionGemma: 4x faster open text model

How the model works

What the sources say

Two trade-offs the sources flag

What to try, and what to weigh up

Sources & quotes

Continue Reading

Opus 5 nearly quadruples the ARC-AGI-3 record

Alibaba's Qwen 3.8 targets Kimi K3

The frontier AI duopoly takes shape