Frontier AI lost a finance test

84.7%A Qwen3-235B fine-tuned on Bridgewater investor judgement, versus 78.2% for the best frontier model tested — at about one-fourteenth the runtime cost.

Bridgewater’s AIA Labs and Thinking Machines Lab, the AI startup founded by former OpenAI chief technology officer Mira Murati, say a Qwen3-235B model they fine-tuned on the hedge fund’s internal investor judgement now beats frontier AI at the work analysts do all day. The result is theirs — both firms have a commercial interest in the headline — but the underlying pattern keeps replicating wherever the right answer lives in private judgement rather than public text.

In their joint report published on Friday, the team recorded 84.7% accuracy on six triage tasks drawn from an investor’s actual routine, against 78.2% for the strongest frontier system they tested. The work wasn’t summarisation or report writing. It was the constant stream of small, repeated judgement calls that surface in a real inbox: is this article relevant to the executive? does the central bank’s language signal the next rate move? Investors make these calls almost without thinking, and they can barely explain how.

The frontier ceiling

Off-the-shelf frontier models were the first to be tried. A basic prompt — flag anything that might affect portfolio construction — left Gemini, Claude and GPT variants hovering around 50% accuracy. Carefully-written expert instructions and a three-tier rating scheme that distinguished genuinely relevant items from merely topical ones lifted accuracy into the mid-70s. It still fell short of the 80% threshold the team set for a deployment they would trust.

Per the report, GPT 5.4 costs around 43% more than GPT 5.2 to run but is only marginally more accurate. Newer, larger frontier models are improving in the abstract; on the messy tasks that decide a workday, the gains are thin.

The fine-tune that broke through

The fix wasn’t a bigger model. It was the labels. The team took the open-weight Qwen3-235B (a publicly downloadable model whose learned settings anyone can retrain) and fine-tuned it on examples labelled by senior Bridgewater investors — the same people whose judgement the AI was meant to replace on the easy 80% of cases.

Building those labels at scale was the hard part. Cheap outside contractors labelled documents first, but a large share of those labels were wrong. Rather than pay senior investors to review the whole dataset, the team trained a first model on the noisy labels, ran it back over the same documents, and kept only the cases where its labels disagreed with the contractors. Those disputed cases — likely errors — went up the chain for correction.

The same trick is replicable elsewhere: pay for the 100 most contested cases, not the 10,000 obvious ones. The fine-tuned model learned directly from the cleaner set and, per the report, ran at about one-fourteenth the cost of the best frontier system tested.

The base model is Qwen3-235B, an open-weight release from Alibaba’s Qwen team. The fine-tuning ran on Tinker, Thinking Machines Lab’s managed retraining platform — a paid service that exposes its training stack to outside customers.

The task set covered six investor workflows: relevance to a specific executive, hawk-or-dove central-bank signals (a “hawk” pushes rates up; a “dove” pushes them down), topic tagging, novelty, sentiment direction and a one-line summary. Documents included news, broker research, central-bank minutes and a sample of internal emails.

The contested-label pipeline ran in three passes: contractors labelled the full set, a noisy student model scored it again, and only the disagreements were routed up to senior investors for the final label. The remaining documents kept the contractor label. The full methodology and results are published in their joint report — secondary coverage of the report is at the Decoder, with the ChipOS brief reframing the result as an operating-system question.

What it means for the field

This isn’t a system you can buy. Bridgewater’s investors, its labelled documents, and the Tinker fine-tuning service they used are theirs. But the finding travels — three frames to carry into your next AI conversation.

The frontier ceiling is a data ceiling. Public models are trained on public text, so where the answer lives in private expertise the ceiling falls fast. The same wall CFOs have hit since 2024 surfaced early in the AI in Finance newsletter, when finance teams quietly walked away from demos underwhelmed by their own numbers.
Your firm’s labels are an asset you can capture. A few hundred well-chosen examples inside your business — this email matters, that one doesn’t — are the seed of the same kind of fine-tune at a smaller scale, on a model you can run on infrastructure you already control.
Watch the cost-per-decision, not the scoreboard. Frontier tiers keep climbing in capability; the metric that determines your ROI is whether the AI was right often enough to be worth the spend, not whether a newer tier exists.

One firm’s inbox, not a benchmark

Two limits, named. The accuracy numbers come from the teams who built the system, not independent testing — the order of magnitude is plausible because the mechanism keeps replicating in other fields, but the exact 84.7% is theirs to defend. And six tasks drawn from one firm’s inbox is a slice of finance, not a benchmark: the lesson about private data generalises, the specific score does not.

The headline is a hedge-fund model. The lesson is generic, and it argues against the assumption that a bigger subscription tier will solve the AI work you most need done.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any ¹ to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

Filed under News · Models

Frontier AI lost a finance test

The frontier ceiling

The fine-tune that broke through

What it means for the field

One firm’s inbox, not a benchmark

Sources & quotes

Continue Reading

OpenAI ships GPT-5.6 Sol under restricted US access

DeepSeek Flash breaks the agent cost curve

AA-Briefcase: a tougher test for agents