Performance Benchmarks Report

Unlock greater AI performance benchmarks at scale for training runs

Explore how our large-scale training benchmarks delivered 20% greater MFU and 10x uptime on 1,024 NVIDIA H100 GPUs, exponentially cutting both runtime and risk.

What does it mean to be purpose-built for AI?

In this 30-page report, Purpose-Built Cloud for AI at Scale: Achieving 20% Higher MFU and 10x Reliability on Thousand-GPU Clusters, the engineers who built the CoreWeave AI Cloud share how they trained a 30-billion-parameter large language model on NVIDIA H100 GPUs.


Head-to-head benchmarks

See how CoreWeave compares to industry leaders on performance, speed, and reliability of NVIDIA H100 GPU clusters.

Key performance and reliability metrics

Get a full breakdown of the metrics that determine real-world speed and reliability, including MFU, ETTR, and MTTF.

Architecture deep dive

Explore the impact of bare-metal GPU clusters, dual fabrics (NVIDIA Quantum InfiniBand networking and NVIDIA BlueField DPUs), SUNK orchestration, and Tensorizer.

Best-practice playbook for scale

Learn practical workflow optimizations you can replicate, like health-check-driven node eviction, automated job re-queue, and tokenization.

Fewer crashes. Better throughput. Minimal wasted time.

When it comes to AI innovation, infrastructure matters. The CoreWeave AI Cloud recently set a new bar for performance, redefining what’s possible when training on NVIDIA H100 GPUs. Imagine what we can do together.

51-52%

Model FLOPs utilization (MFU)

97-98%

Effective training time ratio (ETTR)

3.66 days

Mean time to failure (MTTF)

Left
Right

Explore more on this topic

Scale AI Training Without Slowdowns

Need the TL;DR? This solution brief distills the whitepaper’s metrics into clear business outcomes for your AI workloads.
September 4, 2025
 |
5 min read

NVIDIA H100 GPU benchmark results: What we learned from large-scale GPU testing

Hear the lead author of the whitepaper discuss what these benchmarks for large-scale training.
August 26, 2025
 |
5 min read

Why general-purpose cloud platforms throttle AI innovation

General purpose cloud platforms weren’t built for AI. Discover why AI teams need purpose-built infrastructure to move faster and innovate without compromise.
August 22, 2025
 |
3 min read
Left
Right

Performance that pays dividends

>50% MFU | 8x faster async checkpoint saves

CoreWeave’s cluster sustained >50% average MFU, delivering up to 20% more useful compute per dollar than the industry benchmark.

Async checkpoints powered by Tensorizer cut save time from 129 seconds to 17 seconds, so you can checkpoint more often without slowing training.

Reliability that fuels speed

 10x greater MTTF | >97 ETTR

The same training jobs that crash every eight hours on public baselines run 3.66 days on average on CoreWeave before any interruption—a 10x improvement in reliability.

With 97–98% ETTR, you get maximum productive training time with minimal waste from failures or interruptions.

Webinar

Hear the full story from the engineers who ran the benchmarks.

Join Distinguished Engineer Wes Brown and Product Manager Deok Filho as they pull back the curtain on the methodology, the surprises, and the hard-won optimizations that delivered up to 20% more throughput, 10x longer uptime, and 97–98% utilization.

Take the next step toward more efficient AI training.

Spin-up your own NVIDIA H100 cluster on CoreWeave, and fast-track your AI projects. Let’s start the conversation.

Text Link