How-to · Local & Open

A tiny local model can sort tickets

A developer got a 600-million-parameter local model to 92% accuracy on a household-question classifier after a baseline of 10% — and the practical lesson for UK small firms is that prompt-tweaking a big cloud model is the wrong tool for a narrow job.

R
RAR Editor
Published June 2026 · 5 min read
The Quick Version
  • A developer fine-tuned Qwen 3:0.6B — a 600-million-parameter model that runs on a laptop — to categorise questions about his house.
  • On a 131-question test set: prompt-only baseline 10% (13/131), first fine-tune 79% (104/131), second fine-tune with opaque two-letter codes 92% (120/131).
  • Method: Unsloth with QLoRA, on a dataset of about 850 entries split 70/15/15 across train, eval and test.
  • For narrow, repetitive classification — routing tickets, tagging emails, sorting enquiries — a tiny local fine-tune can replace per-call fees to a big cloud model.
  • To try it yourself: a few hundred labelled examples, Unsloth and a free cloud GPU is enough to find out whether it works for your queue.

The developer Torgeir Helgevold runs a chatbot that answers questions about his house — who cleaned the gutters, which painter did the downstairs, when the pool pump was last replaced. The bot pulls answers from a vector database, but first classifies each question into a metadata category (pool, car, hvac, cooking, gutters) and narrows the search to just that category’s entries. The classification step is the part that broke.

The chatbot uses two local models: Qwen 3:4B for general question answering, and Qwen 3:0.6B — a 600-million-parameter model small enough to run on a laptop — for categorisation. The whole question is whether that tiny model can be fine-tuned into a reliable classifier. The hypothesis Helgevold set out to test, in his write-up: a very small local LLM can be fine-tuned to perform reliable question categorization when trained on a dataset of household-related questions.

The numbers

The baseline — the same 0.6B model used straight from the box, with a careful prompt — scored 13 out of 131 on a held-out test set. That is 10%. The model kept inventing categories that were not on its list (one answer came back as Ollama returned an unknown category name “apartments” from response “apartments”) and over-using broad labels like electric and appliances.

Fine-tune number one, using Unsloth with QLoRA on about 850 entries split 70/15/15, lifted the score to 104 out of 131 (79%). Fine-tune number two — the same data, the same method, but with each category swapped for an opaque two-letter code (AA, BB, CC, and so on) — reached 120 out of 131.

10% → 92% on the same 131-question test set: prompt-only baseline, then a first fine-tune, then a second fine-tune with opaque codes — all on a 600-million-parameter model small enough to run on a laptop.

Why this is the lesson for a UK small firm

The lesson is not “Qwen 3:0.6B is the best classifier ever”. It is that for narrow, repetitive classification — routing support tickets, tagging inbound emails, sorting enquiries by department, screening job applications, flagging supplier invoices — prompting a big cloud model is the wrong tool. A fine-tuned local model scored 92% on a job a 600M-parameter model had no business doing; a frontier model would also score well, but only after per-call fees, a third-party API and whatever latency and rate limits come with the plan.

A local fine-tune flips three of those dials:

  • Cost. After the one-off training run, inference is electricity on hardware you already own. A 0.6B-parameter model runs on a small office server or a spare laptop.
  • Privacy. Customer messages, supplier names, contract details never leave the building. That is the line you put in front of a sceptical partner or DPO.
  • Reliability. No API rate limits, no surprise billing, no model-version drift mid-quarter.

The toolkit is cheap to try. Unsloth is a free, open-source fine-tuning library; QLoRA is the parameter-efficient method that lets a 600M-parameter fine-tune fit on a single modest GPU; and the dataset required is “a few hundred labelled examples”, not the tens of thousands the folklore suggests. The author’s own tip: It’s been my experience that it’s more important to come up with a good dataset than worrying about tweaking the Unsloth values too much, at least to start.

The wrinkle worth knowing

The most interesting finding is buried in the middle of the post. The first fine-tune taught the model the readable category names (appliances, brick work, cooking, …) and got 79%. Helgevold suspected the model was getting confused by semantically overlapping labels — water-related ones especially, where pool, water heater and fountain share a root concept. The fix was not more data and not better hyperparameters. It was replacing the readable labels with fixed, non-overlapping two-letter codes. The accuracy jumped to 92%. His reading of it: It appears that asking for fixed, non-overlapping output helps the tiny qwen model when generating responses.

The wider point is that fine-tuning is partly a labelling problem. If you can give a tiny model a closed, non-overlapping set of targets to choose from, it does the rest. Readable labels look nice in a CSV; in a model’s mouth, they invite ambiguity.

How to try it this week

For a UK firm with a repetitive classification job — a shared inbox, a help-desk queue, a daily flow of supplier invoices or job applications — the path is shorter than the folklore suggests.

  • Pull together a few hundred labelled examples. A CSV of question, label is enough. Quality matters more than quantity: spend an afternoon curating. Include the awkward cases.
  • Pick a tiny base model that runs locally. Qwen 3:0.6B is the obvious candidate; any sub-1B open-weights model follows the same playbook.
  • Use Unsloth with QLoRA. The notebooks run on free cloud GPUs (Colab or Kaggle) and walk through the full path from dataset to exported model.
  • Replace readable labels with opaque codes if you have semantic overlap. Test both. Codes win when readable labels share a root concept.
  • Export and ship locally. Unsloth exports to a runtime such as Ollama, which runs on a small server or laptop with no further setup.

The cost of finding out is one afternoon and a free-tier GPU; the upside is a classifier that runs on kit you own, never phones home, and never sends a usage bill.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any 1 to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

  1. Good results fine-tuning a local LLM (Qwen 3:0.6B) to categorise questions — Torgeir Helgevold
  2. Discussion on Hacker News
Filed under How-to · Local & Open

Continue Reading