How We Built an Agent-Run News Site in 24 Hours — a Full Technical Case Study

This site was scaffolded, staffed and put on an autonomous publishing schedule in the 24 hours after our founder pointed Claude at an empty folder. That story — the speed run — is below, unchanged, because it is true. But a build-log frozen on day one would be the opposite of what we stand for. So this piece is now a living one: updated to describe how the newsroom actually runs today, after we did the thing we should have done from the start and moved it onto a real agent.

The honest correction first. For the first fortnight, “Hermes” was a flattering name for a pile of our own Node scripts that called a model and committed files. It worked, but the agent platform we kept saying we ran on was sitting idle beside it. We have since fixed that: the newsroom now genuinely runs on the NousResearch Hermes agent — nicknamed Mini — which operates the repository itself, through skills, behind gates. Here is how.

Three tiers: who actually does what

The reusable idea, and the one we got right on day one: don’t use one model for everything, and don’t let any model publish unsupervised.

The Governor — Claude, via Claude Code. The expensive, capable model does the work that punishes mistakes: architecture, writing the skills and the editorial rulebook, reviewing output, and fixing failures. It supervises; it does not run the daily desk.
The Publisher — “Mini”, the Hermes agent running MiniMax-M3. A frontier-grade open-weight model on our own server does the high-volume work: drafting, running the gates, staging. It is the agent now — not a script — with its own tools, memory and scheduling.
The human. Commissions pieces, approves anything that goes live, and owns the standards. Minutes a day, on the one decision that is never automated.

Three-tier control: the Governor (Claude) sets guardrails and skills; Mini, the Hermes agent, drafts and stages on the VPS; the Editor approves over Slack; Git and Vercel ship to readers. — Who runs what. The Governor sets the rules and reviews; Mini (the Hermes agent) does the work on the VPS; the Editor approves over Slack. Git is the audit trail; Vercel ships.

The two agents never talk directly. They share a git repository — Mini commits, Vercel rebuilds on push, and every action either agent takes is a commit a human can read, diff and revert. Git as the bridge means the whole operation has an audit trail by construction.

How a story actually flows now

A piece moves through a fixed pipeline, and the interesting part is that two of the steps can stop it dead, and a third hands it to a human.

The loop. Candidates → draft → two hard gates → staged preview → the human approval gate → live. Anything the gates reject, or the editor changes, feeds back into the skills and the rulebook so the next draft is better.

Walking it through, as it ran for the most recent piece:

Candidates. A deterministic ranker pulls and scores stories from two dozen trusted feeds. No model judgement here — just freshness, source quality and “have we covered this”.
Draft. Mini drafts the chosen story against the editorial rulebook (EDITORIAL.md), which it reads fresh every run. This is the constrained step — it works on a specific, sourced brief, which is the only mode a cheap model is reliable in.
Validate gate. A script checks structure, taxonomy, word count and that every cited link resolves. Fail closed.
Fact-check gate. A sub-agent fetches each cited source and confirms every quote and statistic actually appears in it. On the last run it caught a real quote attributed to the wrong source and blocked the piece until it was fixed. That is the gate earning its place.
Staged. The approved-by-machines draft is committed as status: staging, which puts it on a preview site but never on the live one.
The human gate. Mini posts a review card to Slack with the preview link. A person reads it and replies publish — or asks for changes. Only that reply flips the piece live.

The honest limits

We will not pretend this is hands-off magic, because it isn’t, and the failure that taught us most is recent: asked to go and research the web for story ideas with no constraints, MiniMax-M3 spun for eight minutes, called no tools, and produced nothing. Open-ended autonomy on a cheap model is unreliable. So the architecture deliberately does not depend on it. The model is given a ranked shortlist, not a blank page; a specific brief, not a vague goal; and two gates plus a human stand between it and a reader. Autonomy here is earned one constrained step at a time, not assumed.

The cheap model is a fast, tireless drafter that occasionally makes things up. The whole design exists to make that safe: deterministic inputs, hard gates, a human on the button.

The five things that broke — and why that’s the good news

Transparency clause: it did not all work first time. The early failures, each now a permanent guardrail:

The writer invented a statistic. A plausible price comparison that appeared in no source. The fact-check gate now catches that class of error every run — proven again last week.
A formatting quirk broke the build. A fancy metadata structure the schema rejected. Fix: stricter validation before publish, plus verify-and-rollback — a bad deploy removes itself within minutes.
YouTube blocked the server. Datacentre IPs hit a bot wall. Fix: gated sources are fetched from outside and the text shipped in, with the original URL kept as the citation of record.
The model thought itself to death. A big synthesis consumed the entire output budget and returned empty. Fix: an adaptive budget — and, now, the hard rule that the model is never asked to free-range.
It invented its own category names. Near-miss taxonomy labels. Fix: explicit allow-lists in the writing contract plus a mechanical normaliser.

Every one of those became a rule. That list is the real argument for the supervisor model: a cheap workhorse, plus hard gates, plus an expensive reviewer, caught each failure before a reader saw it.

What it costs, and what to steal

The stack is deliberately boring: a €10/month VPS, pay-per-token MiniMax usage (pennies per piece), free hosting and analytics tiers, and the founder’s existing Claude subscription for the Governor. Tens of pounds a month, all in. For a costed, build-it-yourself version of this exact stack, see A business assistant for under £50 a month.

The pattern transfers to almost any repetitive knowledge workflow in a small firm:

Split the roles — a frontier model to architect and review, a cheap one for volume. Paying premium rates for grunt work is the most common agentic-AI budgeting mistake.
Put gates between the agent and the world — schema checks, source checks, rate caps, in code, not in a prompt. Prompts are requests; gates are rules.
Make every action a commit — audit, diff and one-command rollback for free.
Keep a human on the ship-it button for anything reputational, and feed every correction back into the rulebook.
Don’t trust open-ended autonomy from a cheap model — constrain the inputs and let the gates, not the goodwill, keep it honest.

This whole account still covers a young operation. The running experiment is whether quality holds at this cadence for months — now that the agent, not the scaffolding, is the one doing the work. We will keep this page updated, failures included.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any ¹ to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

Filed under Case Studies · Behind the Scenes

How We Built an Agent-Run News Site in 24 Hours — a Full Technical Case Study

Three tiers: who actually does what

How a story actually flows now

The honest limits

The five things that broke — and why that’s the good news

What it costs, and what to steal

Sources & quotes

Continue Reading

OpenAI open-sources its security agent

NVIDIA Turns BioNeMo Into Agent Tools

Anthropic puts a permanent Claude in Slack