NVIDIA’s biggest open-weight model hit Ollama’s cloud this week
NVIDIA released Nemotron 3 Ultra at Jensen Huang’s Computex keynote on 1 June 2026. Three days later, on 4 June, the model was live on Ollama’s cloud — Ollama being a free runtime that lets you pull and run open-weight models either locally or via its hosted infrastructure. Confirmed on NVIDIA’s own model page, the release is the practical entry point for a UK small team: a 550-billion-parameter model is not something you run on a workstation, but it is something you can run from one.
NVIDIA calls Nemotron 3 Ultra “the final and best model of the Nemotron 3 family.” It is the largest open-weight release from a US lab to date, according to independent evaluator Artificial Analysis.
What the model actually is
Nemotron 3 Ultra uses a sparse Mixture-of-Experts design — most of the parameters sit idle on any given request, and a small fraction activates. The split is 550 billion total parameters, 55 billion active per token. The pitch is speed without giving up capability: the model can be large, but only a fraction of it is doing work at any one time.
It supports a 1 million token context window, which NVIDIA says is long enough to keep an entire codebase, a long tool history, or a research trail in working memory across hundreds of agent steps. The model is positioned for “agent orchestration, coding agents, deep research, and complex enterprise workflows that run across hundreds of steps.”
300+tokens per second on a pre-release endpoint at DeepInfra (a hosted inference provider), against 50-100 for peer open models in its size class, per Artificial Analysis.
How it stacks up
Artificial Analysis scored a pre-release version of Nemotron 3 Ultra at 48 on its Intelligence Index — the top score among US open-weight models, ahead of Gemma 4 31B (39), Nemotron 3 Super (36) and gpt-oss-120b (33). Chinese-led open-weight models still lead the index — Kimi K2.6 scored 54 — but the gap has narrowed.
On speed, the same evaluation found the model served “over 300 tokens per second” on a pre-release DeepInfra endpoint. Peer models in its size class from Chinese labs are generally served at 50-100 tokens per second, the report said. NVIDIA’s own measurements, on an 8k input / 64k output setting, claim 5.9x higher throughput than GLM-5.1-754B-A40B, 4.8x higher than Kimi-K2.6-1T-A32B, and 1.6x higher than Qwen-3.5-397B-17B. NVIDIA also claims the model saves up to 30% on cost versus other leading open models — a vendor benchmark rather than an independent measurement, so treat it as a vendor claim.
Ollama’s v0.24 release earlier this year added a Codex-style desktop app and Apple Silicon speed-ups, and the company is positioning itself as an easy way to reach both local and hosted open-weight models. Nemotron 3 Ultra, hosted on Ollama’s cloud, is the largest model on the platform by parameter count.
What to try this afternoon
You can run Nemotron 3 Ultra on Ollama’s cloud without a local GPU. The model is too large to host on a workstation card; Ollama’s cloud is how most small teams will reach it. A practical set of steps for this afternoon:
- Install the Ollama desktop app or CLI. Free; macOS, Linux and Windows. A quick
curlof the install script gets you the CLI in a minute. - Pull and run the model from the cloud. The
ollama runcommand, pointed at the cloud-hosted variant, gets you a chat session against Nemotron 3 Ultra. Billing is per token through Ollama. - Point an agent harness at it. Ollama lists integrations with three coding and agent tools — Claude Code, Hermes Agent and OpenClaw — that route tool calls through whichever model you name. The exact launch strings live on the Ollama Nemotron 3 Ultra page.
- Test it on a real long-running task. A 1M context window is the headline feature; the most useful one-afternoon test is a multi-step agentic workflow — research across several documents, a refactor across a small repo, or a vendor-mapping job — and see whether it loses the thread less than your current model.
For a small UK team, the question is not whether Nemotron 3 Ultra is the smartest model available — Claude Fable 5 still leads the Artificial Analysis index on agentic work, and Kimi K2.6 still leads the open-weight field. The question is whether the largest US open-weight release, hosted cheaply and with a 1M context, gives you a credible second option for the tasks where you want more cost control or a different vendor. The afternoon test will tell you more than any benchmark.
Sources & quotes
Every quotation in this article is verbatim from a named source — click any 1 to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →


