What GameCraft-Bench tested
A research team from CUHK Shenzhen and Tencent’s Hunyuan released GameCraft-Bench on 16 June. The benchmark runs 140 game-building tasks inside Godot — the free, open-source engine popular with indie developers — across 15 genres from platformers to tycoon games to roguelikes (Hugging Face paper page).
Each task hands the agent a brief in plain English (build a tower defence with three enemy types and a score system) and asks for a complete, launchable Godot project plus a short recording of someone playing it. An automated judge then boots the project, replays the recording, watches the gameplay, and scores the result against a hidden scoring checklist (project site).
The point is to test the whole loop — agent writes the game, engine runs it, gameplay evidence is judged — not just whether the code compiles.
The leaderboard at a glance
Seven frontier coding agents took the full 140-task suite. The strongest configuration — Claude Code on Anthropic’s Opus-4.7 — reached 41.46%. OpenAI’s GPT-5.5 via Codex was next at 39.49%. The rest of the field sat well below 40%, and DeepSeek-V4-Pro via Codex bottomed out at 2.15%.
The pattern across the table is more interesting than any single number. Agents do best on core mechanics — Opus-4.7 hit 55.34% there, GPT-5.5 54.36% — and worst on content depth and art and presentation, where even the leader dropped into the mid-30s.
In plain English: the agents can build a jump, a collision, a turn cycle. They cannot reliably build a full game around those things.
What the failures actually look like
The project site flags four findings that point at the same lesson:
- Recognisable mechanics are easier than complete games. Agents more often produce local mechanics but fail to assemble them into coherent whole games.
- Rendered gameplay feedback helps debugging. Agents that watch the game run catch player-facing failures invisible in source code or terminal logs.
- Execution effort alone does not predict quality. Burning more agent turns on a task does not reliably make the output more playable.
- Game generation ability is not monolithic. Mechanics, content, visual feedback and presentation only partially correlate across generated games.
Community reaction tracked the same line. On X, NOVA (@N0V4Dev) wrote that prior AI game-development benchmarks mostly tested simple snippets or text adventures — and that GameCraft-Bench finally tests whether agents can build fully playable games:
Most AI game development benchmarks used to focus on simple code snippets or text-based adventures. This approach ignored the complexity of modern game engines and asset management. Now researchers have introduced GameCraft-Bench to test if agents can build fully playable games…
— NOVA (@N0V4Dev) Jun 17, 2026
How the evaluation works under the hood
Scope a game-agent pilot to one mechanic
For a small team thinking about using a coding agent to spin up a game prototype — a mechanics demo for a pitch, a training scenario, an interactive onboarding flow — the benchmark says three useful things:
- Aim at one mechanic, not a whole game. Opus-4.7’s 55% mechanics score against its 36% art score is the shape of where these agents actually win today. Build the dodge-roll, then hand the rest to a human.
- Insist the output runs. The judging pipeline gates everything on whether the Godot project launches. If your agent can’t produce a project that boots, nothing else matters — code that compiles isn’t enough.
- Give the agent eyes. Rendered gameplay feedback is what unblocks stuck builds. A workflow where the agent watches a screencast or frame dump of its own output — and re-runs against it — will outperform one that only re-reads source. Cursor-style agents that can run the Godot editor and capture screenshots are the practical shape of this today.
The headline reading is the same as the project site’s: the bottleneck isn’t coding speed, it’s that the agent has no closed loop from the code compiles to the player can see what’s happening. Don’t promise stakeholders a finished game from a coding agent in 2026 — the leaderboard suggests that’s still two or three product cycles away, even on the best engine. A pilot scoped to a single mechanic, with a human in the loop for art and content, is a credible this-afternoon ask for a small UK team.
Sources & quotes
Every quotation in this article is verbatim from a named source — click any 1 to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →


