CoreWeave: Products & Services

AI storage and LLMs: 4 critical needs to look out for

AI storage and LLMs: 4 critical needs to look out for

AI storage systems are in high demand. That’s because AI innovators need highly performant storage that can quickly and efficiently read and write data at large scale for training and inference workloads.

Large language models take massive amounts of data to build, train, and deploy with reliability and accuracy. Billion- and trillion-parameter LLMs are now the norm, with Meta’s Llama 2 measuring 70 billion parameters and OpenAI’s GPT-4 measuring a whopping ~1.8 trillion parameters.

As LLMs become more complex, AI enterprises will need even more data to train the next leading model in the market. That means storage systems must be able to deliver—or risk bogging down training times, delaying deployments, and consequently costing companies a lot of money, time, and slower iteration cycles.

That’s why we built CoreWeave AI Object Storage to fulfill four critical needs in GenAI’s use case: fast data access, quick recovery and resiliency, scalability, and airtight security.

1. Fast data access

GenAI models require vast amounts of data to train, run inference, and continuously improve until they’re ready to deploy. The largest, most advanced models run billions or trillions of parameters across diverse data sets. Moving all that data at once can bog down load times, compromising performance and, ultimately, time to market. 

As a result, AI storage solutions must enable fast access to extremely large volumes of data and across large amounts of GPUs.

When storage for AI enables high-speed data transfers, training applications that build LLMs can access and load the data sets they need to train faster and run workflows more efficiently.

Storage systems for AI need a direct path

At CoreWeave, we understand how fundamentally important fast data transfer is to LLM training and production.  CoreWeave AI Object Stoage includes the  Local Object Transport Accelerator (LOTA), which helps enable high-speed connections between GPUs and the storage volumes where critical data lives.

With LOTA, AI teams can get a more direct path between GPUs and data more efficiently. Our simple and secure proxy lives on GPU nodes and listens and responds to Object Storage data requests. LOTA accelerates responses by directly accessing data repositories—bypassing Object Storage gateways and indexes. LOTA also transparently caches data on the local compute node storage, providing faster access for cached data and allowing for pre-staging.

2. Quick recovery and resiliency

Job interruptions happen. When they do, it can be difficult for AI teams to get back on track due to how demanding checkpoint reading, writing, and recovery can be on storage. All actions must happen as quickly as possible to reduce costs incurred by GPU idle time.

Let’s look at the I/O patterns of training a multi-billion parameter model with 4096 NVIDIA H100 GPUs as an example of just how much work can get done in just two hours of training.

This graph demonstrates two clear patterns: an intense burst of read operations when loading data and periodic spikes in write traffic corresponding to checkpointing operations. AI storage  need to ensure resiliency and reliability during these processes specifically in order to keep AI teams and their models on track.

Get better performance and observability

Better reading/writing performance helps users bounce back after job failures. With CoreWeave AI Object Storage, you’ll get:

  • Up to 2 Gigabytes per second per GPU (GB/s/GPU)
  • Each 1 PB of reserved storage enabling 25 GB/s of throughput and 5000 RPS per customer account

CoreWeave AI Object Storage also includes observability and auditing practices that help your teams keep tabs on storage performance—nipping interruptions and issues in the bud. Plus, we enable 99.9% uptime and eleven nines of durability, so your teams can count on top-tier reliability and get models to market ultra-fast.

3. Scalability

When working to build and train AI models, enterprises and labs alike can end up building out a significantly large GPU compute footprint. As models grow in complexity and parameters—and datasets balloon in size—expansion places immense pressure on storage infrastructure to keep pace.

AI workloads demand high-performance storage and the ability to handle massive volumes of data efficiently. Without an equally scalable and high-speed storage solution, even the most powerful GPUs can become bottlenecked by slow data access, reducing throughput and increasing costs.

Scalability in storage helps ensure that workloads and workflows remain fluid, which can mean lower latency and higher GPU utilization. 

CoreWeave AI Object Storage is ultra-scalable

At CoreWeave, we know that AI and ML workloads push the limits of current storage solutions. We built CoreWeave AI Object Storage to be scalable to hundreds of thousands of GPUs at a time, allowing AI teams to experience accelerated performance across a vast amount of compute.

Even at that massive scale, our horizontally scalable solution still allows users to experience performant object storage at 2GB/s per GPU. That means faster, more consistent performance even with very compute-heavy jobs.

4. Air-tight security

Generative AI models train on a vast array of data. Much of that data is likely to be extremely sensitive and can include proprietary models, intellectual property, and even personal information. AI storage solutions are responsible for protecting that data from leaks and safeguarding it against malicious breaches or attacks.

Security processes and protocols that AI Object Storage solutions should implement include:

Encryption (at rest and in transit). Data encryption keeps sensitive information unreadable.

  • Encryption at rest protects stored data on servers. Even if physical disks are compromised and breached, data stays unreadable without the proper encryption keys.
  • Encryption in transit ensures data stays secure. Even when transferred between nodes, servers, cloud environments, or any other environment—preventing man-in-the-middle attacks. 

Identity access management. Access to sensitive data should be monitored and safeguarded with intensive identity authentication.

  • Multi-factor authentication. Require multiple forms of identity authentication to ensure access requests are legitimate.
  • Role-based access controls. Give access to only those who need it, when they need it—limiting unnecessary data exposure.

Keep best practices at the forefront

At CoreWeave, we build every solution with security front-of mind. That’s why our Storage solutions follow industry best practices for security. Encryption at rest and in transit, identity access management protocol, authentication, and policies with roles-based access help ensure client data is protected and secured.

AI storage: It’s not an afterthought

Storage might seem like it doesn’t need to be AI-specialized to work successfully or support GenAI models sufficiently. In reality, AI storage solutions are an essential part of training models with better, faster, and smarter strategies.

LLMs need a storage solution that enables fast access from storage to GPUs, quick recovery after interruptions, and vetted security. Essentially, they need a storage solution that combines the benefits of object storage’s governance, management, and scalability with the speed and direct access of parallel filesystems.

We built CoreWeave AI Object Storage with that mindset from the start. Unlike legacy hyperscaler object storage, Our solution is designed to keep pace with the demands of modern AI workloads, helping to ensure efficiency, speed, and scalability at every stage of the AI lifecycle.

At CoreWeave, storage isn’t just an afterthought. It’s a major key to getting cutting-edge models quickly to market at better price-to-performance ratios.

Learn more about CoreWeave storage benefits in our docs.

Connect with us

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.