NVIDIA Blackwell tops the first agentic AI benchmark

On 12 June 2026, NVIDIA published the first results from AgentPerf, a new benchmark from independent testing firm Artificial Analysis built for agentic AI — the multi-step work that powers coding assistants, customer-service agents and the inbox-tidying tools many businesses already pay for. The headline: the latest NVIDIA Blackwell platform runs up to 20x more agents per megawatt than the previous generation, leading on a metric that is about to show up in the bills businesses pay for AI services.

20×more agents per megawatt on the new NVIDIA Blackwell platform than on the previous generation, in the first AgentPerf round.

What was released

AgentPerf is the first benchmark built around how agentic systems actually work in production — long chains of AI steps, tool use and growing context — rather than a single prompt-and-response exchange. Artificial Analysis built the workload from real coding-agent traces drawn from public repositories across 12+ programming languages, then measured how many concurrent agentic tasks a platform can sustain while meeting defined response-speed thresholds.

NVIDIA submitted its flagship Blackwell Ultra platform and topped the leaderboard at two targets — a comfortable setting for background work and a snappier, more interactive one. Blackwell led at both.

Why the benchmark is different

Most metrics quoted today — response speed, concurrent request counts — were designed for a single AI call. An agent is not a single call. It is a relay: read a file, run a command, observe, decide, repeat — dozens to hundreds of AI calls in sequence, with context growing at each step. The cost compounds rather than adds.

Existing benchmarks understate how hard agentic workloads are on infrastructure, and say nothing about how many agents a system can sustain at once for a given power and capital budget. AgentPerf measures both, on traces that look like real coding workflows rather than synthetic prompts.

Who’s already running on it

The benchmark is not theoretical. NVIDIA named three inference providers (companies that host AI models on their own hardware and rent access to them) — Baseten, DeepInfra and Together AI — already serving frontier open-weight models on Blackwell for production customers:

Together AI powers real-time inference for Cursor, the AI coding assistant whose agents debug, refactor and ship features alongside developers.
DeepInfra runs Pam.ai, which deploys agents to book service appointments, handle calls and run outbound campaigns for car dealerships.

These are the multi-step, tool-using workloads AgentPerf is designed to measure. Cursor and Pam.ai are the production analogues for the benchmark’s synthetic tasks.

How the underlying infrastructure works

The first round of AgentPerf ran a single workload: DeepSeek V4 Pro, a large mixture-of-experts (MoE) model — an architecture where only a subset of parameters activates for any given token, but the full set must stay resident in fast memory. This is now the dominant pattern for frontier open-weight models, and it is what makes the agents-per-megawatt metric the right one to track. MoE models cannot stream parameters from slower storage, so the efficiency question becomes how many such models you can keep alive per watt, not how fast you can generate a token.

GB300 NVL72 is a rack-scale system: 72 Blackwell Ultra GPUs connected by NVLink Switch fabric into a single addressable domain, with up to 30 TB of HBM3e memory in the rack. That capacity lets a 200B+ parameter MoE model run with the full expert set resident, avoiding the latency cost of swapping experts in and out as the agent’s brief grows.

The gain over Hopper (H200) combines more memory bandwidth per GPU, more total memory per rack, and tighter integration between TensorRT LLM and the silicon — specifically the ability to overlap the communication needed between experts with the compute the agent is waiting on.

The benchmark is explicit about what it does not measure: tool calls are simulated with representative CPU time rather than executed, so the numbers isolate the accelerated-computing performance of the agent loop, not the speed of the tools the agent calls. This is deliberate — AgentPerf compares inference platforms, not full agent stacks.

The same NVIDIA blog notes that the next-generation Vera Rubin architecture is already in full production, so GB300 is the current shipping platform, not the last one.

What it signals

A 20× efficiency gain on the provider side does not land directly on any one buyer’s invoice — almost no business will run a Blackwell rack itself. It lands upstream, as pricing pressure on the agent-style products that bill per task rather than per seat — the market we covered in our guide to budgeting for agentic AI costs. Providers competing on Blackwell-class hardware can offer keener per-task rates than rivals on older kit, and that competition is where agents-per-megawatt actually shows up.

The more durable signal is the benchmark itself. Until now, every we are 10× faster claim about agentic infrastructure was a black-box number. AgentPerf gives the market a vendor-neutral yardstick — the kind of figure that ends up cited in procurement and printed on pricing pages. Expect both over the next year.

What to watch over the next six to twelve months: whether per-task prices on the agentic tools you evaluate start to soften as Blackwell-class capacity reaches the providers serving UK customers, and whether vendors begin quoting an AgentPerf number when they pitch. Once one does, the sharper question stops being how fast and becomes how many agent-runs per watt, at the workload I actually have.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any ¹ to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark — NVIDIA Blog

Filed under News · AI Infrastructure

NVIDIA Blackwell tops the first agentic AI benchmark

What was released

Why the benchmark is different

Who’s already running on it

How the underlying infrastructure works

What it signals

Sources & quotes

Continue Reading

Opus 5 lands on AWS at half Fable price

AMD bets $5bn on Anthropic to rival Nvidia

NVIDIA bets the agent era on one protocol