CoreWeave Becomes One of the First Cloud Providers to Achieve NVIDIA Exemplar Cloud Validation for Inference on NVIDIA GB200 NVL72

CoreWeave achieves NVIDIA Exemplar Cloud validation for both training and inference on the NVIDIA Grace Blackwell Platform
CoreWeave Becomes One of the First Cloud Providers to Achieve NVIDIA Exemplar Cloud Validation for Inference on NVIDIA GB200 NVL72

Running production-scale inference workloads is a significant data center scale challenge, requiring optimizations across the entire AI infrastructure stack. When that optimization breaks down, performance suffers, leading to slow user experiences, higher compute costs, and unpredictable reliability, thus slowing down AI innovation and increasing TCO. By establishing the Exemplar Cloud in 2025, NVIDIA provides a standard benchmark for cloud providers to validate their infrastructure performance. 

Today, CoreWeave has become one of the first cloud providers to become an NVIDIA Exemplar Cloud for Inference on NVIDIA GB200 NVL72. CoreWeave demonstrated extraordinary inference throughput and latency results, achieving NVIDIA’s high performance standards based on its reference architecture.

This follows our recent milestone as one of the first cloud providers to achieve NVIDIA Exemplar Cloud for Training on NVIDIA GB200 NVL72. This is  further proof that the CoreWeave Cloud not only delivers a highly performant platform for training AI models, but also for serving them efficiently and reliably in production. 

Together, being one of the first cloud providers to become an NVIDIA Exemplar Cloud for both training and inference showcases CoreWeave’s vertically integrated stack, with Mission Control offering the operating standard for AI cloud with the most performant environment for the entire AI lifecycle. CoreWeave meticulously engineers every layer of our stack from bare metal infrastructure to inference, bringing out the optimal performance of hardware and software combined. That means CoreWeave Cloud is not only highly tuned for training AI models at unprecedented speeds, but also for serving those models efficiently and reliably in production.

NVIDIA Exemplar Cloud represents a consistent benchmarking framework   

NVIDIA Exemplar Cloud provides a standard benchmark for cloud providers to validate workload performance in the cloud. Every participating provider undergoes a comprehensive evaluation process designed to reflect real-world customer needs for highly complex and demanding AI workloads. Becoming an Exemplar Cloud requires the ability to demonstrate high performance and resiliency across a suite of open, workload-specific benchmarking recipes covering inference, fine-tuning, and scaled pretraining. The result: a transparent comparison of performance that is validated using the same criteria. With this consistent benchmark data, AI pioneers can reap the following benefits:. 

  • Predictable, consistent AI workload performance on NVIDIA‑accelerated cloud infrastructure, validated through joint testing and benchmarks
  • Confidence in a tuned, optimized infrastructure stack through co‑engineering and ongoing performance validation with NVIDIA
  • Objective benchmark data to guide which cloud environments to choose, grounded in real application performance measurements, not vendor claims

The results demonstrate how CoreWeave’s approach to GPU performance with full stack observability via Mission Control and automated performance optimizations consistently yields peak performance and reliability. This means AI pioneers have the ability to deploy large-scale training, disaggregated multi-node inference, or anything in between, with the confidence that their jobs will run effectively and efficiently. This minimizes guesswork and consistently gives them access to new GPUs, providing the predictability, reproducibility, and performance AI pioneers need as they evolve models, scale training, and run inference in production.

CoreWeave achieves NVIDIA’s inference benchmark targets

NVIDIA’s Inference benchmarks test DeepSeek-R1, Llama 3.3, and GPT-OSS models in single and multi-node configurations and measure inference throughput and latencies for common agentic use cases. The number of NVIDIA GB200 NVL72 GPUs was specified by NVIDIA along with TRT-LLM or SGLANG as the backend. The throughput test also included NVIDIA Dynamo for multi-node, which is a high-throughput, low-latency distributed inference model.   

For each test scenario, the benchmark evaluated five distinct phases of inference: Reasoning, Chat, Summarization, Generation, and Disaggregation with input and output context lengths. Each is designed to stress-test specific architectural areas to ensure comprehensive coverage within the stack. Metrics used were TPS/GPU (Tokens-Per-Second/GPU) for throughput, and milliseconds for Time-to-First-Token (TTFT) latency. Each test name is followed by (input context length/output context length) below: 

  • Reasoning (1k/1k): This test used 1K input and 1K output context lengths with long prompts and completions reflecting Chain-of-Thought processing. 
  • Chat (128/128): Evaluates responsiveness of interactive applications such as chat, prioritizing ultra-low latency and high user concurrency.
  • Summarization (8k/512): Tests the I/O and memory bandwidth required to ingest massive prompts before generating a concise output.
  • Generation (512/8k): Measures the raw throughput and efficiency of the generation phase, where the model must maintain high speed over a high volume of continuous token production.
  • Disaggregation (8k/1k across nodes): Evaluates the efficiency of disaggregated inference, where the prompt processing and token generation phases are split across different GPU nodes. 

Throughput tests were conducted using DeepSeek-R1, Llama 3.3, and GPT-OSS in single node configuration with one to four NVIDIA Blackwell GPUs and multi-node with NVIDIA Dynamo using 32 NVIDIA Blackwell GPUs. CoreWeave met or exceeded each of the test scenarios across the five distinct phases described above. 

While throughput measures the ability to process and complete the phases of inference of the cluster, TTFT latency measures the speed of the individual unit. In the era of agentic AI, where a single user request might trigger ten sequential model calls, latency becomes the primary constraint on responsiveness. If a model takes too long to process or generate its first word, the user experience suffers, and autonomous agents prove themselves to be too slow to act in dynamic environments.

The latency tests are designed to measure the system's responsiveness. Rather than loading the GPUs to see how much they can handle, these tests measure how fast a single request can move through the stack under optimal conditions. The tests were conducted with DeepSeek-R1 and Llama 3.3 in single node configurations with four GPUs. CoreWeave again met or exceeded each of the test scenarios across the five distinct phases. 

Deep dive into CoreWeave’s inference results

CoreWeave’s inference performance advantages were driven by our vertically integrated stack designed specifically for AI workloads. From compute, storage, networking, to orchestration, CoreWeave’s architecture is purpose built to maximize performance, resiliency, and efficiency. The runtime environment leverages a unified stack with performance optimizations across every layer—from metal to model.

CoreWeave Mission Control

Throughout our inference testing, CoreWeave Mission Control™ served as the central dashboard to manage the test environment. We utilized CoreWeave Mission Control's deep observability to monitor every layer of the stack from hardware-level telemetry like GPU temperature and NVLink bandwidth to application-level metrics like time-to-first-token.  

To ensure NCCL performance was on target, we utilized a recently launched CoreWeave Mission Control feature, CoreWeave GPU Straggler Detection. This tool provides real-time collective metrics, including Bus Bandwidth, ensuring all GPUs and communication paths perform equally.  GPU Straggler Detection also automatically detects any hardware lockups and pinpoints the root cause. 

Dashboard showing CoreWeave GPU Straggler Detection surfacing collective performance metrics on inference_deepseek-r1-dynamo_671b_nvfp4
Figure 1:CoreWeave GPU Straggler Detection surfacing collective performance metrics on inference_deepseek-r1-dynamo_671b_nvfp4

CoreWeave Mission Control dashboard correlating job-level metrics with infrastructure signals for inference_deepseek-r1-dynamo_671b_nvfp4
Figure 2: CoreWeave Mission Control dashboard correlating job-level metrics with infrastructure signals for inference_deepseek-r1-dynamo_671b_nvfp4

AI-optimized infrastructure

CoreWeave pairs NVIDIA GB200 NVL 72 with high bandwidth networking and optimized memory architectures to eliminate bottlenecks common in generalized cloud environments. This enables sustained high utilization and predictable inference performance at scale.

Optimized model serving for single and multi-node inference 

Inference workloads on CoreWeave leverage optimized serving runtimes and scheduling strategies designed to maximize throughput while minimizing latency. Efficient batching, request routing, and GPU utilization are built into the serving stack.

Fast model startup and loading

Production inference depends largely on how quickly models can be brought online. CoreWeave’s high-performance object storage and fast provisioning pipelines significantly reduce model load times, improving time-to-first-token and scaling responsiveness.

Impact of inference performance for production AI systems

CoreWeave’s inference performance builds on the same architectural principles that power its industry-leading training results: purpose-built infrastructure, deep integration across the stack, and a relentless focus on performance under realistic conditions. Becoming an NVIDIA Exemplar Cloud is more than a technical milestone—it is a guarantee of trusted performance, transparency, and reliability, aligned with NVIDIA’s high performance standards, allowing organizations to scale AI with absolute confidence. 

For a global retail leader, this might mean deploying agentic customer service bots that can reason through complex returns and supply chain redirects in real-time with expected latency. In financial services, it provides the deterministic performance required for rapid fraud detection and high-frequency risk simulations where every millisecond of time-to-first-token directly impacts the bottom line. By meeting or exceeding the NVIDIA Exemplar Cloud benchmarks for models like DeepSeek-R1 and Llama 3.3, CoreWeave provides validated performance that leading organizations need for everything from running autonomous logistics agents for global shipping to interactive digital twins for industrial manufacturing.

For teams deploying AI applications in production, these performance gains translate directly into operational advantages:

  • Faster and more responsive end-user experiences
  • Lower cost per inference through higher efficiency
  • Improved reliability during traffic spikes
  • Predictable performance as workloads scale

CoreWeave delivers the Essential Cloud for AITM 

CoreWeave’s commitment to designing a purpose-built AI cloud enables pioneers to uniquely capitalize on the latest GPU architectures. By delivering best-in-class results across both training and inference, CoreWeave provides a unified AI cloud platform capable of supporting the entire AI lifecycle from model development to global production deployment.From large language models to multimodal and real-time inference pipelines, CoreWeave enables production AI systems to operate with confidence at scale.

With CoreWeave Mission Control, the industry’s first operating standard for running AI at production scale, CoreWeave enables efficient fleet operations with full transparency, proven reliability, and deep insights. CoreWeave Cloud enables AI pioneers to deploy their most critical training and inference workloads on a platform verified by NVIDIA to deliver:

  • Reduced training and inference time: Groundbreaking performance results that lower the overall time and cost required for training and inference
  • Unrivaled reliability: Confidence that long-running jobs will complete without interruption, backed by a rigorously tested architecture
  • Performance transparency: The highest level of performance transparency and reproducibility validated by NVIDIA Exemplar Cloud 

The most recent Exemplar Cloud results underscore our continued collaboration with NVIDIA, and CoreWeave's ability to maximize efficiency for the most demanding training and inference workloads reinforcing our position as the #1 AI cloud

Please see additional resources below and join us for a webinar to learn more:

CoreWeave Becomes One of the First Cloud Providers to Achieve NVIDIA Exemplar Cloud Validation for Inference on NVIDIA GB200 NVL72

CoreWeave is one of the first cloud providers to achieve NVIDIA Exemplar Cloud validation for both training and inference on the NVIDIA Grace Blackwell Platform

Related Blogs

GPU Compute,
Copy code
Copied!