Fine-Tune Gemma 4 with Unsloth

Unsloth, a library for training AI models more efficiently, has published a fine-tuning guide for Google’s Gemma 4, a family of open-weights models, this month. The guide covers all five variants in the line-up.

According to the Unsloth documentation, the library trains Gemma 4 roughly 1.5x faster and uses around 60% less memory than standard setups, with no accuracy loss claimed. The guide documents several Gemma 4-specific training bugs that Unsloth says it has fixed at the library level.

What Unsloth released alongside it

The team launched Unsloth Studio, a no-code local app for loading, training and export, that runs on macOS, Windows and Linux. Free cloud-hosted notebooks cover the smaller variants; the larger ones ship with high-memory hardware notebooks.

What the Ideas2IT team did with it

The engineering team at Ideas2IT walked through the same path on a free cloud laptop and documented every step. Their target was a small HR policy assistant, fine-tuned on synthetic conversations generated from the firm’s own policy documents. They used Unsloth’s data tool to turn raw PDFs into a structured training set, fine-tuned the smallest Gemma 4 variant, and exported the result.

Their published result: a small synthetic dataset, trained for a short run on the free tier, produced a model that stopped inventing policy details. The fine-tuned model pulled in the specific policy terms and eligibility conditions from the training set, where the base model produced only generic HR language. The full pipeline, from raw PDFs to a deployable model, fits inside a single notebook.

20 minof training on a free cloud laptop was enough to turn the smallest Gemma 4 variant into a model that stopped inventing policy details.

The five Gemma 4 variants Unsloth supports, and the hardware each demands at fine-tuning time:

E2B — 2B parameters, multimodal. LoRA fits on 8–10GB VRAM; inference in 2–6GB. Audio support is exclusive to E2B and E4B.
E4B — 4B, multimodal. LoRA needs ~17GB VRAM. Unsloth recommends this over E2B for most cases because the quantisation accuracy hit is small.
12B — text and vision, mid-sized sweet spot.
26B-A4B — mixture-of-experts. LoRA needs more than 40GB VRAM.
31B — QLoRA on 22GB. Quality target, A100 territory in practice.

Context windows run to 128K tokens for the small variants and 256K for the 26B and 31B. Gemma 4 supports 140 languages.

The Ideas2IT recipe that fits on a free laptop: LoRA rank 16, alpha equal to rank, 8-bit AdamW optimiser, learning rate 2e-4, 100 training steps, effective batch size 4 via gradient accumulation. Training time: 15–20 minutes on a T4, 8–10 on an L4, around 5 on an A100. A 100-conversation synthetic dataset was the published minimum for a clear effect.

Loss quirks: E2B and E4B start at 13–15 because of the multimodal architecture, not a configuration error. The 26B and 31B start at 1–3. If you see losses in the hundreds, gradient accumulation is being miscounted — Unsloth ships a fix. Use the gemma-4-thinking chat template on the larger models, the standard gemma-4 template on the small ones, and mix at least 75% reasoning-style examples to preserve thinking behaviour.

Export options: a LoRA adapter (around 100–200MB), a merged model, or a portable file for local runtimes such as Ollama or llama.cpp. The data tool turns PDFs and manuals into structured training data — Ideas2IT used 1,200-token chunks with 200-token overlap.

What to do with this

For a small UK firm wondering whether fine-tuning is worth an afternoon, three questions frame the test:

Is the task narrow and repetitive? A fine-tuned small model is at its best on a fixed format: customer replies, contract clauses, internal Q&A, classification, structured summarisation. It is not a replacement for a frontier model on open-ended reasoning.
Is the format as important as the content? If your house style, product names or policy terms are the part that keeps breaking under prompting, a fine-tune is the lever.
Can a human review the training set? A synthetic dataset generated unsupervised will bake in the generator’s confusions. Five minutes of human review before the run is worth the effort.

The hardware bar is now low enough that a sole trader can test the thesis on a free cloud laptop, no workstation required. The data-residency argument is the one to put in front of a sceptical partner or DPO: once trained, the model runs locally on kit the firm already owns, and the source documents never have to leave the building during inference. If the experiment works, you have a model that sounds like your firm. If it does not, you have lost an afternoon.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any ¹ to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

Filed under Explainer · Local Inference

Fine-Tune Gemma 4 with Unsloth

What Unsloth released alongside it

What the Ideas2IT team did with it

What to do with this

Sources & quotes

Continue Reading

Qwen 3.6 outranks Gemma 4 on intelligence

Stock these open models before political disruption hits

mistral.rs v0.9.0 outpaces llama.cpp on CPU