A tiny local model can sort tickets

The developer Torgeir Helgevold runs a chatbot that answers questions about his house — who cleaned the gutters, which painter did the downstairs, when the pool pump was last replaced. The bot pulls answers from a vector database, but first classifies each question into a metadata category (pool, car, hvac, cooking, gutters) and narrows the search to just that category’s entries. The classification step is the part that broke.

The chatbot uses two local models: Qwen 3:4B for general question answering, and Qwen 3:0.6B — a 600-million-parameter model small enough to run on a laptop — for categorisation. The whole question is whether that tiny model can be fine-tuned into a reliable classifier. The hypothesis Helgevold set out to test, in his write-up: a very small local LLM can be fine-tuned to perform reliable question categorization when trained on a dataset of household-related questions.

The numbers

The baseline — the same 0.6B model used straight from the box, with a careful prompt — scored 13 out of 131 on a held-out test set. That is 10%. The model kept inventing categories that were not on its list (one answer came back as Ollama returned an unknown category name “apartments” from response “apartments”) and over-using broad labels like electric and appliances.

Fine-tune number one, using Unsloth with QLoRA on about 850 entries split 70/15/15, lifted the score to 104 out of 131 (79%). Fine-tune number two — the same data, the same method, but with each category swapped for an opaque two-letter code (AA, BB, CC, and so on) — reached 120 out of 131.

10% → 92% on the same 131-question test set: prompt-only baseline, then a first fine-tune, then a second fine-tune with opaque codes — all on a 600-million-parameter model small enough to run on a laptop.

Why this is the lesson for a UK small firm

The lesson is not “Qwen 3:0.6B is the best classifier ever”. It is that for narrow, repetitive classification — routing support tickets, tagging inbound emails, sorting enquiries by department, screening job applications, flagging supplier invoices — prompting a big cloud model is the wrong tool. A fine-tuned local model scored 92% on a job a 600M-parameter model had no business doing; a frontier model would also score well, but only after per-call fees, a third-party API and whatever latency and rate limits come with the plan.

A local fine-tune flips three of those dials:

Cost. After the one-off training run, inference is electricity on hardware you already own. A 0.6B-parameter model runs on a small office server or a spare laptop.
Privacy. Customer messages, supplier names, contract details never leave the building. That is the line you put in front of a sceptical partner or DPO.
Reliability. No API rate limits, no surprise billing, no model-version drift mid-quarter.

The toolkit is cheap to try. Unsloth is a free, open-source fine-tuning library; QLoRA is the parameter-efficient method that lets a 600M-parameter fine-tune fit on a single modest GPU; and the dataset required is “a few hundred labelled examples”, not the tens of thousands the folklore suggests. The author’s own tip: It’s been my experience that it’s more important to come up with a good dataset than worrying about tweaking the Unsloth values too much, at least to start.

The wrinkle worth knowing

The most interesting finding is buried in the middle of the post. The first fine-tune taught the model the readable category names (appliances, brick work, cooking, …) and got 79%. Helgevold suspected the model was getting confused by semantically overlapping labels — water-related ones especially, where pool, water heater and fountain share a root concept. The fix was not more data and not better hyperparameters. It was replacing the readable labels with fixed, non-overlapping two-letter codes. The accuracy jumped to 92%. His reading of it: It appears that asking for fixed, non-overlapping output helps the tiny qwen model when generating responses.

Both fine-tunes used Unsloth with QLoRA on the same Qwen 3:0.6B base model, the same around-850-entry dataset, and the same 70/15/15 split. The only change was the label format the model was trained to emit:

First fine-tune: readable category strings — hvac, gutters, pool, water heater — drawn from a fixed list of 18 household categories.
Second fine-tune: opaque uppercase two-letter codes (AA, BB, CC …), one per category, with the model trained to emit exactly one code per question.

Held-out test score: 79% with readable strings, 92% with opaque codes. Fixed, non-overlapping output formats sidestep the model’s tendency to fragment a known label into a near-miss (ac for hvac) and remove its inclination to “help” by inventing plausible-sounding category names.

QLoRA, in one line, adapts a model by training a small number of low-rank adapter weights rather than the full parameter set, which is what makes a 600M-parameter fine-tune fit on a single modest GPU. Unsloth wraps QLoRA with sensible defaults and a one-line export path to local runtimes.

The wider point is that fine-tuning is partly a labelling problem. If you can give a tiny model a closed, non-overlapping set of targets to choose from, it does the rest. Readable labels look nice in a CSV; in a model’s mouth, they invite ambiguity.

How to try it this week

For a UK firm with a repetitive classification job — a shared inbox, a help-desk queue, a daily flow of supplier invoices or job applications — the path is shorter than the folklore suggests.

Pull together a few hundred labelled examples. A CSV of question, label is enough. Quality matters more than quantity: spend an afternoon curating. Include the awkward cases.
Pick a tiny base model that runs locally. Qwen 3:0.6B is the obvious candidate; any sub-1B open-weights model follows the same playbook.
Use Unsloth with QLoRA. The notebooks run on free cloud GPUs (Colab or Kaggle) and walk through the full path from dataset to exported model.
Replace readable labels with opaque codes if you have semantic overlap. Test both. Codes win when readable labels share a root concept.
Export and ship locally. Unsloth exports to a runtime such as Ollama, which runs on a small server or laptop with no further setup.

The cost of finding out is one afternoon and a free-tier GPU; the upside is a classifier that runs on kit you own, never phones home, and never sends a usage bill.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any ¹ to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

Filed under How-to · Local & Open

A tiny local model can sort tickets

The numbers

Why this is the lesson for a UK small firm

The wrinkle worth knowing

How to try it this week

Sources & quotes

Continue Reading

Ai2 ships Tmax-27B terminal agent

Donate coding sessions to train open models

Nemotron 3 Ultra: America's best open model