Demos
Video

SUNK Self-Service Demo

Play video

SUNK is designed for AI research teams running the most demanding training workloadsβ€”where job duration, scale, and failure tolerance make reliability and predictability as critical as raw performance. SUNK delivers a production-ready, researcher-first training system that abstracts infrastructure complexity while preserving the Slurm workflows researchers rely on.Β 

And now you can spin up a SUNK cluster using SUNK self-service. In just one click, researchers and platform teams can get a unified training system able to handle the most critical workloads, without the operational burden.

1

00:00:03,520 --> 00:00:06,360

Hi, I'm Deok, a PM here at CoreWeave.

2

00:00:06,520 --> 00:00:07,920

Let me tell you about some exciting

3

00:00:07,920 --> 00:00:10,080

things we're doing with SUNK.

4

00:00:10,080 --> 00:00:11,880

SUNK Self-Service

5

00:00:11,880 --> 00:00:12,920

turns spinning up

6

00:00:12,920 --> 00:00:14,040

a Slurm-on-Kubernetes

7

00:00:14,040 --> 00:00:14,640

cluster

8

00:00:14,640 --> 00:00:16,320

from a week of infrastructure

9

00:00:16,320 --> 00:00:18,200

work into a few clicks.

10

00:00:18,200 --> 00:00:19,440

With only one click,

11

00:00:19,440 --> 00:00:20,160

all your nodes

12

00:00:20,160 --> 00:00:23,240

magically flow into Slurm, IAM users get

13

00:00:23,360 --> 00:00:24,240

SSH access,

14

00:00:24,240 --> 00:00:25,000

the control plane

15

00:00:25,000 --> 00:00:26,320

is automatically right-sized

16

00:00:26,320 --> 00:00:27,240

to your needs,

17

00:00:27,240 --> 00:00:28,560

and a shared file system spans

18

00:00:28,560 --> 00:00:30,120

your whole cluster.

19

00:00:30,120 --> 00:00:31,720

One click to get a production

20

00:00:31,720 --> 00:00:33,120

research cluster.

21

00:00:33,120 --> 00:00:34,880

Users can change the default

22

00:00:34,880 --> 00:00:36,440

to fit their needs:

23

00:00:36,440 --> 00:00:37,800

You can safely manage access

24

00:00:37,800 --> 00:00:38,400

through IAM

25

00:00:38,400 --> 00:00:39,360

groups, reduce

26

00:00:39,360 --> 00:00:42,360

CPU costs by turning off login pods,

27

00:00:42,360 --> 00:00:43,520

or drop into the YAML

28

00:00:43,520 --> 00:00:45,040

for advanced configs.

29

00:00:45,040 --> 00:00:46,160

There's no helm charts,

30

00:00:46,160 --> 00:00:47,840

and there's no waiting.

31

00:00:47,840 --> 00:00:48,880

CoreWeave manages

32

00:00:48,880 --> 00:00:49,680

the end-to-end

33

00:00:49,680 --> 00:00:51,720

SUNK cluster life cycle

34

00:00:51,720 --> 00:00:54,000

with automated upgrades and patches,

35

00:00:54,000 --> 00:00:55,520

so you can get the Slurm experience

36

00:00:55,520 --> 00:00:56,680

your researchers expect

37

00:00:56,680 --> 00:00:59,360

without owning the operational burden.

38

00:00:59,360 --> 00:01:01,040

Customers can also deploy SUNK

39

00:01:01,040 --> 00:01:03,600

through a Kubernetes custom resource.

40

00:01:03,600 --> 00:01:05,600

You edit the CR directly

41

00:01:05,600 --> 00:01:07,600

for advanced workflows

42

00:01:07,600 --> 00:01:09,120

so you can change things

43

00:01:09,120 --> 00:01:10,720

like your Slurm configurations,

44

00:01:10,720 --> 00:01:12,800

QOS settings partitions.

45

00:01:12,800 --> 00:01:14,680

And because it's just a CR,

46

00:01:14,680 --> 00:01:16,360

it drops right into your existing

47

00:01:16,360 --> 00:01:17,640

GitOps workflow,

48

00:01:17,640 --> 00:01:18,760

so you can use a tool

49

00:01:18,760 --> 00:01:19,360

like Argo

50

00:01:19,360 --> 00:01:20,760

CD or whatever continuous

51

00:01:20,760 --> 00:01:22,720

delivery thing you use.

52

00:01:22,720 --> 00:01:23,720

You keep everything

53

00:01:23,720 --> 00:01:25,880

customers love about SUNK:

54

00:01:25,880 --> 00:01:27,000

the SUNK pod scheduler

55

00:01:27,000 --> 00:01:28,480

for unifying workloads

56

00:01:28,480 --> 00:01:30,600

like inference, sandboxes and training,

57

00:01:30,600 --> 00:01:32,400

in the same cluster

58

00:01:32,400 --> 00:01:34,520

driving up the utilization.

59

00:01:34,520 --> 00:01:35,520

You can also do

60

00:01:35,520 --> 00:01:36,600

the same prebuilt

61

00:01:36,600 --> 00:01:38,120

dashboards and custom metrics

62

00:01:38,120 --> 00:01:40,440

with our deep observability capabilities,

63

00:01:40,440 --> 00:01:41,280

and you get the benefit

64

00:01:41,280 --> 00:01:42,560

of the CoreWeave integration

65

00:01:42,560 --> 00:01:44,240

with health checks and burn-in tests.

66

00:01:45,240 --> 00:01:45,960

Finally, you get

67

00:01:45,960 --> 00:01:47,920

optimized job performance

68

00:01:47,920 --> 00:01:50,160

with topology aware scheduling.

69

00:01:50,160 --> 00:01:52,840

So it's the same SUNK, same scheduler

70

00:01:52,840 --> 00:01:54,360

your researchers trust,

71

00:01:54,360 --> 00:01:56,400

but now self-service

72

00:01:56,400 --> 00:01:57,960

and managed in day one.