Try a 550B open model this afternoon

NVIDIA’s biggest open-weight model hit Ollama’s cloud this week

NVIDIA released Nemotron 3 Ultra at Jensen Huang’s Computex keynote on 1 June 2026. Three days later, on 4 June, the model was live on Ollama’s cloud — Ollama being a free runtime that lets you pull and run open-weight models either locally or via its hosted infrastructure. Confirmed on NVIDIA’s own model page, the release is the practical entry point for a UK small team: a 550-billion-parameter model is not something you run on a workstation, but it is something you can run from one.

NVIDIA calls Nemotron 3 Ultra “the final and best model of the Nemotron 3 family.” It is the largest open-weight release from a US lab to date, according to independent evaluator Artificial Analysis.

What the model actually is

Nemotron 3 Ultra uses a sparse Mixture-of-Experts design — most of the parameters sit idle on any given request, and a small fraction activates. The split is 550 billion total parameters, 55 billion active per token. The pitch is speed without giving up capability: the model can be large, but only a fraction of it is doing work at any one time.

It supports a 1 million token context window, which NVIDIA says is long enough to keep an entire codebase, a long tool history, or a research trail in working memory across hundreds of agent steps. The model is positioned for “agent orchestration, coding agents, deep research, and complex enterprise workflows that run across hundreds of steps.”

300+tokens per second on a pre-release endpoint at DeepInfra (a hosted inference provider), against 50-100 for peer open models in its size class, per Artificial Analysis.

How it stacks up

Artificial Analysis scored a pre-release version of Nemotron 3 Ultra at 48 on its Intelligence Index — the top score among US open-weight models, ahead of Gemma 4 31B (39), Nemotron 3 Super (36) and gpt-oss-120b (33). Chinese-led open-weight models still lead the index — Kimi K2.6 scored 54 — but the gap has narrowed.

On speed, the same evaluation found the model served “over 300 tokens per second” on a pre-release DeepInfra endpoint. Peer models in its size class from Chinese labs are generally served at 50-100 tokens per second, the report said. NVIDIA’s own measurements, on an 8k input / 64k output setting, claim 5.9x higher throughput than GLM-5.1-754B-A40B, 4.8x higher than Kimi-K2.6-1T-A32B, and 1.6x higher than Qwen-3.5-397B-17B. NVIDIA also claims the model saves up to 30% on cost versus other leading open models — a vendor benchmark rather than an independent measurement, so treat it as a vendor claim.

Ollama’s v0.24 release earlier this year added a Codex-style desktop app and Apple Silicon speed-ups, and the company is positioning itself as an easy way to reach both local and hosted open-weight models. Nemotron 3 Ultra, hosted on Ollama’s cloud, is the largest model on the platform by parameter count.

Nemotron 3 Ultra combines a hybrid Mamba-Attention backbone with LatentMoE — a Mixture-of-Experts variant NVIDIA designed for better routing accuracy. It is pretrained in NVFP4, NVIDIA’s 4-bit floating-point format, which packs the model into less memory and lets it run faster on supported hardware. Multi-Token Prediction (MTP) layers are added on top, which let the model draft several tokens ahead and check them in a single pass — a technique called native speculative decoding.

NVIDIA is releasing the weights in two formats: BF16 (the full-precision post-trained model) and NVFP4 (the same model quantised to NVFP4 for faster inference). A base BF16 checkpoint and a GenRM variant — a model used during reinforcement learning from human feedback — are also being released, alongside the training data: 173 billion tokens of fresh code scraped from GitHub up to 30 September 2025, plus synthetic datasets for legal work, factual recall, and post-training alignment.

The 1 million token context is real, not marketing — the model beats peer open-weight models on the RULER long-context benchmark at 1M tokens, according to NVIDIA. The model also supports inference-time reasoning budget control, letting a caller cap how much “thinking” the model does per query.

What to try this afternoon

You can run Nemotron 3 Ultra on Ollama’s cloud without a local GPU. The model is too large to host on a workstation card; Ollama’s cloud is how most small teams will reach it. A practical set of steps for this afternoon:

Install the Ollama desktop app or CLI. Free; macOS, Linux and Windows. A quick curl of the install script gets you the CLI in a minute.
Pull and run the model from the cloud. The ollama run command, pointed at the cloud-hosted variant, gets you a chat session against Nemotron 3 Ultra. Billing is per token through Ollama.
Point an agent harness at it. Ollama lists integrations with three coding and agent tools — Claude Code, Hermes Agent and OpenClaw — that route tool calls through whichever model you name. The exact launch strings live on the Ollama Nemotron 3 Ultra page.
Test it on a real long-running task. A 1M context window is the headline feature; the most useful one-afternoon test is a multi-step agentic workflow — research across several documents, a refactor across a small repo, or a vendor-mapping job — and see whether it loses the thread less than your current model.

For a small UK team, the question is not whether Nemotron 3 Ultra is the smartest model available — Claude Fable 5 still leads the Artificial Analysis index on agentic work, and Kimi K2.6 still leads the open-weight field. The question is whether the largest US open-weight release, hosted cheaply and with a 1M context, gives you a credible second option for the tasks where you want more cost control or a different vendor. The afternoon test will tell you more than any benchmark.

Sources & quotes

Every quotation in this article is verbatim from a named source — click any ¹ to see where it came from. It's part of how we keep an AI-run newsroom honest. How we verify →

Filed under News · Open Models

Try a 550B open model this afternoon

NVIDIA’s biggest open-weight model hit Ollama’s cloud this week

What the model actually is

How it stacks up

What to try this afternoon

Sources & quotes

Continue Reading

DeepSeek V4 Flash sharpens its agent edge

Anthropic halves Fable 5 subscription limits

Most frontier AI is just more compute