Featured · Deep Dive · Local Inference

Running Local Inference on the New Gemma Models — From Departmental Hardware

How small teams are deploying quantised Gemma 3 models on commodity GPUs to run private, offline inference pipelines. No cloud required, no data leaving the building.

R
RAR Editor
Published June 2026 · 8 min read
The Quick Version
  • Gemma 3's 4B and 12B quantised variants run on consumer 8–16GB GPUs — no GPU farm required.
  • A ~£1,800 reconditioned workstation matches roughly four months of mid-tier API spend.
  • The stack stays minimal: Ollama, a FastAPI wrapper and n8n — no Docker expertise needed.
  • Early result: a 40% cut in admin overhead, with the hardware paying back in 6–8 months.

The privacy calculus for small teams has shifted. Running a 7B-parameter model on a consumer GPU no longer requires a GPU farm — it requires an afternoon, a USB stick, and a willingness to ignore the SaaS sales funnel.

Why Local Inference Changes the Equation

For teams handling sensitive client data — accountants, legal professionals, HR administrators — the cloud inference paradigm carries a systemic risk that enterprise data processing agreements rarely fully resolve. Local inference eliminates that risk category entirely.

Gemma 3, Google’s latest open-weights model family, arrives at a pivotal moment. The 4B and 12B quantised variants run efficiently on consumer RTX-class GPUs with 8–16GB of VRAM, making departmental deployment viable without a capital approval process.

Hardware Baseline

For this deployment we worked with an accounts team running a reconditioned workstation: an RTX 3090 (24GB VRAM), 64GB of DDR4, and a 2TB NVMe. Total hardware cost was roughly £1,800 — equivalent to about four months of a mid-tier LLM API subscription at moderate usage volume.

  • GPU — RTX 3090, 24GB VRAM. Runs the 12B model with comfortable headroom.
  • Memory — 64GB DDR4. Enough for inference plus the surrounding workflow tooling.
  • Storage — 2TB NVMe. Holds the model weights and a local document cache.
  • Total cost — ~£1,800, reconditioned. Roughly four months of mid-tier API spend.

Model Selection

The team evaluated three quantisation levels across Gemma 3 variants. The Q4_K_M quantisation of the 12B model emerged as the practical default: strong instruction-following on document tasks, a ~6GB VRAM footprint at rest, and sub-two-second first-token latency on a cold start.

40%reduction in admin overhead across invoice processing and document classification within the first operational month.

The Deployment Stack

The integration stack is deliberately minimal: Ollama as the inference server, a lightweight FastAPI wrapper for structured output, and n8n for workflow orchestration. No Docker knowledge required beyond a basic install; the entire stack runs as persistent background services that start with the machine.

  • Ollama. The local inference server, exposing an OpenAI-compatible API on port 11434.
  • FastAPI wrapper. A thin layer that enforces structured JSON output for downstream steps.
  • n8n. Orchestrates the end-to-end workflow, from email trigger to spreadsheet write.
# Install Ollama and pull the model
curl -fsSL https://ollama.com/install.sh | sh
ollama pull gemma3:12b-instruct-q4_K_M

# Verify the deployment
ollama run gemma3:12b-instruct-q4_K_M \
  "Summarise this invoice in JSON format: ..."

Connecting to n8n

With Ollama exposing a local OpenAI-compatible API on port 11434, integration with n8n requires nothing more than configuring an HTTP Request node to point at http://localhost:11434/v1/chat/completions, with the model name specified in the request body.

The workflow receives an email trigger, calls the local inference endpoint with a structured extraction prompt, and writes the parsed output to a shared spreadsheet — no human in the loop unless confidence falls below a defined threshold.

Limitations and Caveats

Local inference is not a universal replacement for cloud APIs. For tasks requiring real-time web retrieval, multi-modal reasoning at scale, or seamless multi-user collaborative access, cloud inference remains the pragmatic choice. The value proposition is narrower but sharp: sensitive document processing, repeatable structured extraction, and workflows where data residency is non-negotiable.

The hardware pays back in roughly six to eight months at moderate usage. After that, inference cost is effectively zero.

Next Steps

The complete blueprint for this deployment — including the Ollama install script, the FastAPI structured-output wrapper, and the n8n workflow JSON — is available in the Blueprints section. It is designed to be deployed in under 90 minutes on compatible hardware.

Filed under Workflows · Case Studies

Continue Reading