Gemma 4 outpaces Qwen 3.6 on code review

Gemma 4 finishes the code review first

A controlled benchmark on the Kaitchup substack and a self-hoster’s field report both reach the same verdict: Google’s Gemma 4 31B beats Alibaba’s Qwen 3.6 27B on agentic code work, and finishes faster. The surprising variable is Multi-Token Prediction (MTP), a technique that drafts several tokens at once to speed up generation. Gemma 4’s MTP implementation is doing real work; Qwen 3.6’s is producing weaker output on coding tasks.

Kaitchup ran both models through identical accuracy, latency and memory tests. Qwen 3.6 dominated hard maths (AIME-style problems, scoring a CoDeC contamination score above 62 — rare in this size class) and world knowledge (MMLU Pro). Gemma 4 31B held a lead on instruction following (IFBench), graduate-level reasoning (GPQA Diamond) and raw latency. A larger model running faster than a smaller dense one is the headline that took off on X.

What the benchmarks actually show

Kaitchup’s numbers, cross-checked against Artificial Analysis on at least one metric, paint a more nuanced picture:

Hard maths (AIME): Qwen 3.6 ahead of both Qwen 3.5 and Gemma 4. CoDeC score above 62.
World knowledge (MMLU Pro): Qwen 3.6 ahead.
Single-turn coding (LiveCodeBench): Qwen 3.6 ahead of Qwen 3.5 but behind Gemma 4 on pass@1; tied at pass@4.
Instruction following (IFBench): Gemma 4 ahead by a wide margin.
Graduate reasoning (GPQA Diamond): Gemma 4 ahead — a surprise, since Alibaba’s own numbers claim a 2.3-point improvement for Qwen 3.6. Kaitchup suspects different evaluation setups; Artificial Analysis found the same.

Qwen 3.6 is sharper on raw knowledge and maths; Gemma 4’s combination of a mixture-of-experts (MoE) architecture — where only some parameters fire per token — plus MTP is calmer and faster on the agent workflow that matters in practice.

The MTP surprise in the field

Qwen 3.6 27B is great but I have found Gemma 4 31B much more reliable. It doesn’t overthink, uses the right tools only when needed, and can run faster thanks to its superior MTP design. A larger model running faster than a smaller one, that’s crazy!!

— Behnam (@OrganicGPT), X, 6 June 2026

Benchmarks don’t always survive contact with real code. One self-hoster running Qwen 3.6 27B Q8_K_XL (an 8-bit quantisation tuned for quality) on four RTX 5070 Ti cards through llama.cpp and the OpenCode CLI reported that in roughly eight out of ten runs, the non-MTP variant produced more findings, in more detail, on a simple Do a code review of this branch. prompt than the MTP variant did.

MTP is a latency play, not always a quality play. For code review and other reasoning-heavy agentic tasks, drafting multiple tokens at once can hurt as much as it helps. The post above attributes the difference to Gemma 4’s MTP design — it doesn’t overthink simple steps and only invokes tools when they’re needed.

For UK teams self-hosting on modest hardware, MTP support varies by engine: llama.cpp doesn’t yet support MTP for Gemma 4 31B, so if you want the speed-up you’ll need vLLM (an inference engine optimised for serving models at scale) or another runtime.

The two models head-to-head

	Qwen 3.6 27B	Gemma 4 31B
Architecture	Dense (all params active)	MoE (only some active per token)
MTP support	Yes, but quality varies by engine	Yes, well-tuned; not yet in llama.cpp
Best at	AIME maths, MMLU Pro	Instruction following, GPQA Diamond
Quantisation tolerance	Higher — INT8 and hybrid INT4–BF16 hold up	Lower — full BF16 (16-bit precision) preferred
Recommended engine	llama.cpp or vLLM	vLLM, or MLX on Apple Silicon
Approx VRAM (full BF16)	around 54 GB	around 62 GB

Why this matters for inference cost: Gemma 4 31B, despite being larger, uses fewer active parameters per token thanks to MoE, so end-to-end cost to hit a target accuracy is often lower than Qwen 3.6 27B on coding tasks.

How to try it this afternoon

You don’t need a four-GPU rig. A single 24 GB card runs both models in Q4 or Q5 quantisation (4-bit or 5-bit — quality is good enough for code review, and the models fit in roughly 18–22 GB of VRAM).

Pull both with Ollama (ollama pull qwen3.6:27b and ollama pull gemma4:31b), or browse the Qwen and Gemma repos on Hugging Face for a specific quant. We compared Ollama and LM Studio in LM Studio vs Ollama in 2026 if you want the trade-offs first.
Install OpenCode CLI (npm i -g opencode) — a small open-source coding agent that talks to local endpoints via Ollama.
Point both at the same prompt on a small repo: Do a code review of this branch and list findings with file:line references. Save each output separately.
Time them. Wall-clock seconds and total tokens consumed. MoE-vs-dense and MTP differences show up clearly at the token level.
Turn MTP on and off in vLLM to reproduce the field report. With Qwen 3.6, expect the non-MTP run to be more thorough; with Gemma 4, MTP is the speed lever and quality stays flat.

What to weigh up:

Gemma 4 31B wins if your daily workload is agent-style coding, code review, or anything where stop thinking and call the tool matters more than raw knowledge.
Qwen 3.6 27B wins if you want one model for maths, summarisation and reasoning-heavy Q&A without swapping weights — and you’re quantising hard.
If you’re tight on VRAM, the Qwen 3.6-35B-A3B MoE we covered in Qwen3.6-35B-A3B is the local coding agent stays under 24 GB.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any ¹ to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

Filed under News · Local Models

Gemma 4 outpaces Qwen 3.6 on code review

Gemma 4 finishes the code review first

What the benchmarks actually show

The MTP surprise in the field

How to try it this afternoon

Sources & quotes

Continue Reading

Ai2 ships Tmax-27B terminal agent

A tiny local model can sort tickets

Donate coding sessions to train open models