News · Models

Frontier AI lost a finance test

The numbers come from the teams who built them. The lesson is generic: where the right answer lives in private judgement rather than public text, an off-the-shelf model tops out fast.

R
RAR Editor
Published July 2026 · 4 min read
The Quick Version
  • Bridgewater's AIA Labs and Thinking Machines Lab say a Qwen3-235B fine-tuned on internal investor judgement hit 84.7% on six routine finance triage tasks.
  • In their evaluation, frontier variants of Gemini, Claude and GPT scored around 50% on a basic prompt and only reached the mid-70s with expert-written instructions.
  • The fine-tuned model also cost roughly 14 times less to run than the best frontier system tested.
  • Where the right answer lives in someone's head rather than in public data, off-the-shelf AI tops out fast — and a small amount of well-chosen private data closes the gap.
84.7%A Qwen3-235B fine-tuned on Bridgewater investor judgement, versus 78.2% for the best frontier model tested — at about one-fourteenth the runtime cost.

Bridgewater’s AIA Labs and Thinking Machines Lab, the AI startup founded by former OpenAI chief technology officer Mira Murati, say a Qwen3-235B model they fine-tuned on the hedge fund’s internal investor judgement now beats frontier AI at the work analysts do all day. The result is theirs — both firms have a commercial interest in the headline — but the underlying pattern keeps replicating wherever the right answer lives in private judgement rather than public text.

In their joint report published on Friday, the team recorded 84.7% accuracy on six triage tasks drawn from an investor’s actual routine, against 78.2% for the strongest frontier system they tested. The work wasn’t summarisation or report writing. It was the constant stream of small, repeated judgement calls that surface in a real inbox: is this article relevant to the executive? does the central bank’s language signal the next rate move? Investors make these calls almost without thinking, and they can barely explain how.

The frontier ceiling

Off-the-shelf frontier models were the first to be tried. A basic prompt — flag anything that might affect portfolio construction — left Gemini, Claude and GPT variants hovering around 50% accuracy. Carefully-written expert instructions and a three-tier rating scheme that distinguished genuinely relevant items from merely topical ones lifted accuracy into the mid-70s. It still fell short of the 80% threshold the team set for a deployment they would trust.

Per the report, GPT 5.4 costs around 43% more than GPT 5.2 to run but is only marginally more accurate. Newer, larger frontier models are improving in the abstract; on the messy tasks that decide a workday, the gains are thin.

The fine-tune that broke through

The fix wasn’t a bigger model. It was the labels. The team took the open-weight Qwen3-235B (a publicly downloadable model whose learned settings anyone can retrain) and fine-tuned it on examples labelled by senior Bridgewater investors — the same people whose judgement the AI was meant to replace on the easy 80% of cases.

Building those labels at scale was the hard part. Cheap outside contractors labelled documents first, but a large share of those labels were wrong. Rather than pay senior investors to review the whole dataset, the team trained a first model on the noisy labels, ran it back over the same documents, and kept only the cases where its labels disagreed with the contractors. Those disputed cases — likely errors — went up the chain for correction.

The same trick is replicable elsewhere: pay for the 100 most contested cases, not the 10,000 obvious ones. The fine-tuned model learned directly from the cleaner set and, per the report, ran at about one-fourteenth the cost of the best frontier system tested.

What it means for the field

This isn’t a system you can buy. Bridgewater’s investors, its labelled documents, and the Tinker fine-tuning service they used are theirs. But the finding travels — three frames to carry into your next AI conversation.

  • The frontier ceiling is a data ceiling. Public models are trained on public text, so where the answer lives in private expertise the ceiling falls fast. The same wall CFOs have hit since 2024 surfaced early in the AI in Finance newsletter, when finance teams quietly walked away from demos underwhelmed by their own numbers.
  • Your firm’s labels are an asset you can capture. A few hundred well-chosen examples inside your business — this email matters, that one doesn’t — are the seed of the same kind of fine-tune at a smaller scale, on a model you can run on infrastructure you already control.
  • Watch the cost-per-decision, not the scoreboard. Frontier tiers keep climbing in capability; the metric that determines your ROI is whether the AI was right often enough to be worth the spend, not whether a newer tier exists.

One firm’s inbox, not a benchmark

Two limits, named. The accuracy numbers come from the teams who built the system, not independent testing — the order of magnitude is plausible because the mechanism keeps replicating in other fields, but the exact 84.7% is theirs to defend. And six tasks drawn from one firm’s inbox is a slice of finance, not a benchmark: the lesson about private data generalises, the specific score does not.

The headline is a hedge-fund model. The lesson is generic, and it argues against the assumption that a bigger subscription tier will solve the AI work you most need done.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any 1 to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

  1. GPT and Claude failed Bridgewater's finance tests because the right answers were never public
  2. GPT and Claude failed Bridgewater's finance tests because the right answers were never public (ChipOS brief)
  3. AI just failed in finance — AI In Finance
Filed under News · Models

Continue Reading