SUNK (Slurm on Kubernetes) redefines the modern AI research cluster by unifying scheduling, reliability, and observability into a single production-grade training system.
In this solution brief, you’ll learn how SUNK enables:
- Up to 96% training goodput to maximize productive GPU time
- 97–98% effective training time (ETTR) across multi-day runs
- 10× longer mean time to failure (MTTF) for thousand-GPU clusters
- Unified Slurm and Kubernetes workflows on the same underlying cluster
- Built-in observability and automated recovery through CoreWeave Mission Control
Free researchers to focus on model progress, not infrastructure coordination. See how SUNK delivers predictable performance, deep operational visibility, and simplified lifecycle management.


