News · Agents

Coding agents top out at 41% on games

GameCraft-Bench, released 16 June by a CUHK Shenzhen and Tencent Hunyuan team, scored the leading coding agents at 41% on building a playable Godot game. The shape of the failures is more useful than the headline number — here's what it means for a small team piloting an agent today.

R
RAR Editor
Published June 2026 · 6 min read
The Quick Version
  • GameCraft-Bench, released 16 June, runs 140 end-to-end game-building tasks inside Godot.
  • The strongest agent — Claude Code on Opus-4.7 — reached 41.46%; six other frontier agents scored below 40%.
  • Agents reliably produce single mechanics, but struggle to assemble them into a coherent, playable whole.
  • Rendered gameplay feedback unblocks stuck builds in ways source-code review doesn't.
  • Practical lesson for a UK small team: scope a game-agent pilot to one mechanic, not a finished product.

What GameCraft-Bench tested

A research team from CUHK Shenzhen and Tencent’s Hunyuan released GameCraft-Bench on 16 June. The benchmark runs 140 game-building tasks inside Godot — the free, open-source engine popular with indie developers — across 15 genres from platformers to tycoon games to roguelikes (Hugging Face paper page).

Each task hands the agent a brief in plain English (build a tower defence with three enemy types and a score system) and asks for a complete, launchable Godot project plus a short recording of someone playing it. An automated judge then boots the project, replays the recording, watches the gameplay, and scores the result against a hidden scoring checklist (project site).

The point is to test the whole loop — agent writes the game, engine runs it, gameplay evidence is judged — not just whether the code compiles.

The leaderboard at a glance

Seven frontier coding agents took the full 140-task suite. The strongest configuration — Claude Code on Anthropic’s Opus-4.7 — reached 41.46%. OpenAI’s GPT-5.5 via Codex was next at 39.49%. The rest of the field sat well below 40%, and DeepSeek-V4-Pro via Codex bottomed out at 2.15%.

The pattern across the table is more interesting than any single number. Agents do best on core mechanics — Opus-4.7 hit 55.34% there, GPT-5.5 54.36% — and worst on content depth and art and presentation, where even the leader dropped into the mid-30s.

In plain English: the agents can build a jump, a collision, a turn cycle. They cannot reliably build a full game around those things.

What the failures actually look like

The project site flags four findings that point at the same lesson:

  • Recognisable mechanics are easier than complete games. Agents more often produce local mechanics but fail to assemble them into coherent whole games.
  • Rendered gameplay feedback helps debugging. Agents that watch the game run catch player-facing failures invisible in source code or terminal logs.
  • Execution effort alone does not predict quality. Burning more agent turns on a task does not reliably make the output more playable.
  • Game generation ability is not monolithic. Mechanics, content, visual feedback and presentation only partially correlate across generated games.

Community reaction tracked the same line. On X, NOVA (@N0V4Dev) wrote that prior AI game-development benchmarks mostly tested simple snippets or text adventures — and that GameCraft-Bench finally tests whether agents can build fully playable games:

How the evaluation works under the hood

Scope a game-agent pilot to one mechanic

For a small team thinking about using a coding agent to spin up a game prototype — a mechanics demo for a pitch, a training scenario, an interactive onboarding flow — the benchmark says three useful things:

  • Aim at one mechanic, not a whole game. Opus-4.7’s 55% mechanics score against its 36% art score is the shape of where these agents actually win today. Build the dodge-roll, then hand the rest to a human.
  • Insist the output runs. The judging pipeline gates everything on whether the Godot project launches. If your agent can’t produce a project that boots, nothing else matters — code that compiles isn’t enough.
  • Give the agent eyes. Rendered gameplay feedback is what unblocks stuck builds. A workflow where the agent watches a screencast or frame dump of its own output — and re-runs against it — will outperform one that only re-reads source. Cursor-style agents that can run the Godot editor and capture screenshots are the practical shape of this today.

The headline reading is the same as the project site’s: the bottleneck isn’t coding speed, it’s that the agent has no closed loop from the code compiles to the player can see what’s happening. Don’t promise stakeholders a finished game from a coding agent in 2026 — the leaderboard suggests that’s still two or three product cycles away, even on the best engine. A pilot scoped to a single mechanic, with a human in the loop for art and content, is a credible this-afternoon ask for a small UK team.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any 1 to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

  1. GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? — Hugging Face paper page
  2. GameCraft-Bench project site — leaderboard, demos and methodology
  3. NOVA (@N0V4Dev) on X — community reaction to the benchmark
Filed under News · Agents

Continue Reading