News · AI Infrastructure

NVIDIA Blackwell tops the first agentic AI benchmark

Artificial Analysis launched AgentPerf to measure agent workloads, not single chats. NVIDIA's latest Blackwell platform leads on agents-per-megawatt — the metric that quietly sets the cost floor for agentic AI services.

R
RAR Editor
Published June 2026 · 5 min read
The Quick Version
  • Artificial Analysis released AgentPerf, the first benchmark built specifically for multi-step AI work, with results published on 12 June 2026.
  • NVIDIA's latest Blackwell platform ran up to 20x more AI agents per megawatt than the previous generation, leading the first-round leaderboard.
  • Three providers — Baseten, DeepInfra and Together AI — are already running production agent services on Blackwell for customers including Cursor and Pam.ai.
  • The benchmark gives buyers a third-party figure to push back on AI agent pricing claims — expect it to shape per-task AI bills over the next 6–12 months.
  • Because AgentPerf is vendor-neutral, it becomes a number buyers can cite against any '10× faster' infrastructure claim in a procurement conversation.
NVIDIA Blackwell tops the first agentic AI benchmark

Photo: NVIDIA · Press image · via NVIDIA

On 12 June 2026, NVIDIA published the first results from AgentPerf, a new benchmark from independent testing firm Artificial Analysis built for agentic AI — the multi-step work that powers coding assistants, customer-service agents and the inbox-tidying tools many businesses already pay for. The headline: the latest NVIDIA Blackwell platform runs up to 20x more agents per megawatt than the previous generation, leading on a metric that is about to show up in the bills businesses pay for AI services.

20×more agents per megawatt on the new NVIDIA Blackwell platform than on the previous generation, in the first AgentPerf round.

What was released

AgentPerf is the first benchmark built around how agentic systems actually work in production — long chains of AI steps, tool use and growing context — rather than a single prompt-and-response exchange. Artificial Analysis built the workload from real coding-agent traces drawn from public repositories across 12+ programming languages, then measured how many concurrent agentic tasks a platform can sustain while meeting defined response-speed thresholds.

NVIDIA submitted its flagship Blackwell Ultra platform and topped the leaderboard at two targets — a comfortable setting for background work and a snappier, more interactive one. Blackwell led at both.

Why the benchmark is different

Most metrics quoted today — response speed, concurrent request counts — were designed for a single AI call. An agent is not a single call. It is a relay: read a file, run a command, observe, decide, repeat — dozens to hundreds of AI calls in sequence, with context growing at each step. The cost compounds rather than adds.

Existing benchmarks understate how hard agentic workloads are on infrastructure, and say nothing about how many agents a system can sustain at once for a given power and capital budget. AgentPerf measures both, on traces that look like real coding workflows rather than synthetic prompts.

Who’s already running on it

The benchmark is not theoretical. NVIDIA named three inference providers (companies that host AI models on their own hardware and rent access to them) — Baseten, DeepInfra and Together AI — already serving frontier open-weight models on Blackwell for production customers:

  • Together AI powers real-time inference for Cursor, the AI coding assistant whose agents debug, refactor and ship features alongside developers.
  • DeepInfra runs Pam.ai, which deploys agents to book service appointments, handle calls and run outbound campaigns for car dealerships.

These are the multi-step, tool-using workloads AgentPerf is designed to measure. Cursor and Pam.ai are the production analogues for the benchmark’s synthetic tasks.

How the underlying infrastructure works

What it signals

A 20× efficiency gain on the provider side does not land directly on any one buyer’s invoice — almost no business will run a Blackwell rack itself. It lands upstream, as pricing pressure on the agent-style products that bill per task rather than per seat — the market we covered in our guide to budgeting for agentic AI costs. Providers competing on Blackwell-class hardware can offer keener per-task rates than rivals on older kit, and that competition is where agents-per-megawatt actually shows up.

The more durable signal is the benchmark itself. Until now, every we are 10× faster claim about agentic infrastructure was a black-box number. AgentPerf gives the market a vendor-neutral yardstick — the kind of figure that ends up cited in procurement and printed on pricing pages. Expect both over the next year.

What to watch over the next six to twelve months: whether per-task prices on the agentic tools you evaluate start to soften as Blackwell-class capacity reaches the providers serving UK customers, and whether vendors begin quoting an AgentPerf number when they pitch. Once one does, the sharper question stops being how fast and becomes how many agent-runs per watt, at the workload I actually have.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any 1 to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

  1. NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark — NVIDIA Blog
Filed under News · AI Infrastructure

Continue Reading