Coding agents top out at 41% on games

What GameCraft-Bench tested

A research team from CUHK Shenzhen and Tencent’s Hunyuan released GameCraft-Bench on 16 June. The benchmark runs 140 game-building tasks inside Godot — the free, open-source engine popular with indie developers — across 15 genres from platformers to tycoon games to roguelikes (Hugging Face paper page).

Each task hands the agent a brief in plain English (build a tower defence with three enemy types and a score system) and asks for a complete, launchable Godot project plus a short recording of someone playing it. An automated judge then boots the project, replays the recording, watches the gameplay, and scores the result against a hidden scoring checklist (project site).

The point is to test the whole loop — agent writes the game, engine runs it, gameplay evidence is judged — not just whether the code compiles.

The leaderboard at a glance

Seven frontier coding agents took the full 140-task suite. The strongest configuration — Claude Code on Anthropic’s Opus-4.7 — reached 41.46%. OpenAI’s GPT-5.5 via Codex was next at 39.49%. The rest of the field sat well below 40%, and DeepSeek-V4-Pro via Codex bottomed out at 2.15%.

The pattern across the table is more interesting than any single number. Agents do best on core mechanics — Opus-4.7 hit 55.34% there, GPT-5.5 54.36% — and worst on content depth and art and presentation, where even the leader dropped into the mid-30s.

In plain English: the agents can build a jump, a collision, a turn cycle. They cannot reliably build a full game around those things.

What the failures actually look like

The project site flags four findings that point at the same lesson:

Recognisable mechanics are easier than complete games. Agents more often produce local mechanics but fail to assemble them into coherent whole games.
Rendered gameplay feedback helps debugging. Agents that watch the game run catch player-facing failures invisible in source code or terminal logs.
Execution effort alone does not predict quality. Burning more agent turns on a task does not reliably make the output more playable.
Game generation ability is not monolithic. Mechanics, content, visual feedback and presentation only partially correlate across generated games.

Community reaction tracked the same line. On X, NOVA (@N0V4Dev) wrote that prior AI game-development benchmarks mostly tested simple snippets or text adventures — and that GameCraft-Bench finally tests whether agents can build fully playable games:

Most AI game development benchmarks used to focus on simple code snippets or text-based adventures. This approach ignored the complexity of modern game engines and asset management. Now researchers have introduced GameCraft-Bench to test if agents can build fully playable games…
— NOVA (@N0V4Dev) Jun 17, 2026

How the evaluation works under the hood

The judging pipeline is deliberately multimodal. For each task the automated judge first checks whether the Godot project actually launches — no score otherwise. It then replays the agent’s submitted gameplay recording into video and sampled frames. A vision-language model running against a hidden item list scores that footage across four categories:

Core mechanics — does the requested gameplay loop work under player input?
Content depth — enough progression, content, state variation to feel like a real game?
Functional visuals — can the player read state, feedback and transitions?
Art and presentation — does it look coherent and appropriately styled?

The agent set tested was: Claude Code (Anthropic Opus-4.7 and Xiaomi MiMo-V2.5-Pro); Codex (OpenAI GPT-5.5 and DeepSeek-V4-Pro); Kimi Code (Moonshot Kimi-K2.6); and Code Buddy (Zhipu GLM-5.1 and MiniMax-M2.7). Each was run on the agent’s high effort setting. The engine is Godot 4, and the team’s benchmark and execution harness is built on Harbor.

The 15 game families in the suite are weighted roughly toward platformers (19 tasks), strategy (17), tycoon (16) and open-world (15), then thinner coverage of roguelike, visual novel, puzzle, shooter, simulation, card, horror, rhythm, idle, racing and sports. Full per-task and per-family breakdowns are on the project site.

Scope a game-agent pilot to one mechanic

For a small team thinking about using a coding agent to spin up a game prototype — a mechanics demo for a pitch, a training scenario, an interactive onboarding flow — the benchmark says three useful things:

Aim at one mechanic, not a whole game. Opus-4.7’s 55% mechanics score against its 36% art score is the shape of where these agents actually win today. Build the dodge-roll, then hand the rest to a human.
Insist the output runs. The judging pipeline gates everything on whether the Godot project launches. If your agent can’t produce a project that boots, nothing else matters — code that compiles isn’t enough.
Give the agent eyes. Rendered gameplay feedback is what unblocks stuck builds. A workflow where the agent watches a screencast or frame dump of its own output — and re-runs against it — will outperform one that only re-reads source. Cursor-style agents that can run the Godot editor and capture screenshots are the practical shape of this today.

The headline reading is the same as the project site’s: the bottleneck isn’t coding speed, it’s that the agent has no closed loop from the code compiles to the player can see what’s happening. Don’t promise stakeholders a finished game from a coding agent in 2026 — the leaderboard suggests that’s still two or three product cycles away, even on the best engine. A pilot scoped to a single mechanic, with a human in the loop for art and content, is a credible this-afternoon ask for a small UK team.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any ¹ to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →