TLDR: The Decart and Cerebrium partnership allows you to process 1 million tokens of Llama 2 70B for just $0.50 – try out the API here! Decart built a proprietary LLM inference engine from scratch in C++ and NVIDIA CUDA to outperform other existing engines, and Cerebrium built a cutting-edge serverless compute platform. Key to this achievement was leveraging NVIDIA H100 Tensor Core GPUs and the CUTLASS library, as well as writing kernels custom to the H100 GPU to ensure low latency even at the unprecedented $0.50 price tag. The partnership was supported by the strong developer ecosystem that NVIDIA created, allowing the fusing of multiple solutions into NVIDIA accelerated computing applications. Proof point already delivered with CoreWeave GPU infrastructure.
Large language models (LLMs) are becoming increasingly valuable in crafting solutions for both business and consumer use cases. However, their deployment has been hampered by many factors, such as cost and latency, making wide-scale deployment challenging.
OpenAI’s GPT-4 Turbo spans a price range from $10 to $30 per 1 million tokens, while GPT-3.5 Turbo falls between $1 and $2 per 1 million tokens. Anthropic offers Claude 2.1 at a cost between $11 and $32.70 and Claude Instant between $1.60 and $5.50. Others provide Llama 2 70B at $1-$2 per million tokens. How can we get Llama 2 70B for $0.50 per 1 million tokens (API) and still keep the latency incredibly low? The answer requires a combination of Decart’s proprietary LLM inference engine running on NVIDIA H100 Tensor Core GPUs and Cerebrium’s serverless infrastructure.
Cerebrium’s serverless GPU infrastructure platform allows companies to scale from 0 to 10K requests in just a few seconds. It does this by optimizing cold start times and loading models 30% faster from disk to GPU than Hugging Face, leading to large cost savings. Decart has developed an LLM inference engine from scratch in CUDA and C++, which by far outpaces all existing engines at running all open-source LLMs.
“Leveraging hardware capabilities introduced in the NVIDIA H100 GPU, such as thread block clusters, as well as new versions of CUTLASS, allows us to maintain the same Llama 2 70B cost per token generation that we achieve on NVIDIA A100 GPUs, while showing an almost 3x improvement in latency.”
Dean Leitersdorf, Decart CEO
The Decart inference engine utilizes unique paging techniques in order to enable incredible efficiency even when there are hundreds or thousands of concurrent requests on a single GPU node. For instance, the typical paged attention mechanism used throughout open-source libraries is replaced by a system that uses the internal memory paging abilities of the hardware itself. This optimization provides very significant throughput speedups, even with very large models. To demonstrate this, we tested our tools on CoreWeave infrastructure to run the largest current-gen LLM, Falcon 180B by TII and achieved the following results compared to vLLM and TGI.
When working with Llama 2 70B, we first optimized to increase throughput to ensure we could host the API at $0.50 per million tokens. Then, it was paramount to ensure a low per-token latency. To do so, we ported our backend for Llama 2 70B from NVIDIA A100 GPUs to run on NVIDIA H100 GPUs, writing custom kernels which use the new hardware capabilities of H100 GPUs to drastically reduce per-token latency. Ultimately, the following per-token latency guarantees were achieved for Decart Engine while keeping down costs to provide the API at $0.50 per million tokens.
To access the API and test the performance, please visit Decart here. If you would like to get in touch for model requests or private deployments, please reach out to [email protected] or [email protected].
DECART: Decart is designing the new AI stack as a platform to offer superior Gen AI performance in multiple dimensions, including quality, speed, hardware utilization, costs, reliability, flexibility. We are already earning the trust of customers by enhancing the execution of their LLM inference workloads through delivering multipliers on inference speed. On top of the performance improvements we deliver now in production, we have a near-term roadmap to keep improving LLM performance consistently. In addition, Decart’s flow is compatible with multimodal LLMs, and we expect to be at the forefront of multimodal LMM performance at scale. To stay tuned for more releases currently in the pipeline, follow us on X.
CEREBRIUM: Cerebrium is a platform to deploy machine learning models to serverless GPUs with sub-5 second cold-start times. Customers typically experience 40% in cost savings when compared to using traditional cloud providers and can scale models to more than 10K requests per minute with minimal engineering overhead. Simply write your code in Python and Cerebrium takes care of all infrastructure and scaling. Cerebrium is being used by companies and engineers from Twilio, Rudderstack, Matterport and many more.
PARTNERSHIP: The collaboration between Decart and Cerebrium shows how the combined expertise in LLM optimization and scalable, low-latency serverless GPU provisioning can lead to efficient and high-performing LLM applications. In line with the results above, this partnership will unlock better user experiences and unit economics for businesses, alleviating some of the pain associated with deploying LLMs at scale for businesses. Both Decart and Cerebrium are driving this partnership to enable the next million-user LLM-based applications.