2024-10-29

GPU go brrrrr, but at what cost?

Identifying the most price efficient AI inference accelerators

By Waleed Atallah and Kimbo Chen

The AI computing market is highly competitive, with Nvidia currently dominating due to its comprehensive plug-and-play ecosystem. However, as the industry matures, cost-efficiency is becoming increasingly important.

This investigation aims to identify the most price-efficient AI inference accelerators using vLLM and Llama-3.1 models. To ensure results reflect typical user experiences, no performance optimizations or tuning will be applied. The evaluation will use out-of-the-box configurations with common best practices and readily available on-demand pricing from cloud providers.

The goal is to provide a straightforward comparison of performance and cost-efficiency under normal conditions, helping users make informed decisions based on their needs and budget constraints.

Benchmark

We run the three flavors of Meta Llama 3.1 in their native 16-bit floating point data type. We load the models into each GPU, spin up a server using vLLM (v0.5.3), and send 100 random prompts with an average input length of 250 tokens. After an initial “warmup prompt”, the GPUs then chug through all the prompts as fast as possible, and the average output token generation is measured.

GPU Selection

We put seven popular GPUs through this benchmark, including four from NVIDIA and three from AMD. We select the NVIDIA GPUs through popular on-demand cloud instances. The AMD GPUs are made available to MAKO through our cloud partners, Tensorwave and Cirrascale.

	NVIDIA				AMD
	A10G	L40S	A100 SXM	H100 SXM	MI210	MI250	MI300X
VRAM	24 GB	48 GB	80 GB	80 GB	64 GB	128 GB	192 GB
Cost	$1.21/hr	$1.03/hr	$1.94/hr	$3.99/hr	$0.50/hr	$1.00/hr	$3.99/hr
Provider	AWS	RunPod	RunPod	RunPod	MAKO*	MAKO**	MAKO**

*GPUs made available through Tensorwave

**GPUs made available through Cirrascale

Model Deployment

Meta-Llama-3.1-8B model weights take about 14 GB of GPU memory, and fits on each GPU tested. The results of a single instance are shown.

Meta-Llama-3.1-70B model weights take closer to 140 GB of GPU memory. This requires that the model be split across two GPUs due to its size, except on the AMD MI300X. In some cases, memory constraints limited the full 128k context length. For this short-input benchmark, we trimmed the max context length to fit on two GPUs where applicable. A future benchmark will explore long context inference using more GPUs.

Meta-Llama-3.1-405B could only fit onto a node of 8x AMD MI300X GPUs. All other GPUs didn’t have enough memory to fit into a single node.

Results

		NVIDIA				AMD
		A10G	L40S	A100 SXM	H100 SXM	MI210	MI250	MI300X
Tokens per second	Llama-8B	509	783	1019	1296	499	598	913
	Llama-70B	-	-	295* (2x GPUs)	476* (2x GPUs)	-	181 (2x GPUs)	253** (1x GPU)
	Llama-405B	-	-	-	-	-	-	208 (8x GPUs)
Tokens per Dollar	Llama-8B	1.51e06	2.74e06	1.89e06	1.17e06	3.59e06	2.15e06	8.24e05
	Llama-70B	-	-	2.74e05* (2x GPUs)	2.15e05* (2x GPUs)	-	3.26e05 (2x GPUs)	2.28e05** (1x GPU)
	Llama-405B	-	-	-	-	-	-	2.35e04 (8x GPUs)

*The NVIDIA H100 and NVIDIA A100 implementations restricted context length to 10k, down from 128k, in order to fit the model onto two GPUs.

**The MI300X implementation restricted the context length to 125k, down from 128k, in order to fit the model onto a single GPU.

Llama 8B

Speed Winner: NVIDIA H100 SXM

Cost Winner: AMD MI210

In terms of raw tokens per second, Nvidia GPUs regularly outshine their AMD counterparts. In particular, the two fastest GPUs are the NVIDIA H100 and AMD A100, respectively.

Regarding price efficiency, the AMD MI210 reigns supreme as the most cost effective accelerator for small 8B parameter models. Its low cost, coupled with high memory bandwidth puts it in the top spot, followed by the Nvidia L40S and the AMD MI250 respectively.

Llama 70B

Speed Winner: NVIDIA H100

Cost Winner: AMD MI250

Running Llama-70B on two NVIDIA H100 produced the fastest results, although with an asterisk. The Llama-70B model weights alone take about 140 GB out of the 160 GB of memory available on the two NVIDIA H100s (and for the NVIDIA A100s). To run the model, we had to restrict the context length to 10,000 tokens, down from a maximum of 128,000. We deemed this acceptable, as we’re only testing short input lengths in this benchmark.

The most cost-efficient GPU was AMD MI250. With 128 GB of memory in each AMD MI250, the two devices had a total of 256 GB between them, enabling the full 128k context length between them. When we do get to long context benchmarking, we expect the cost advantage of the AMD MI250 to shine even more.

Llama 405B

No Contest Winner: AMD MI300X

The MI300X was the only GPU that could support running Llama-405B in a single node, so it is our no-contest winner in both speed and cost. The 192 GB of HBM in each MI300X enables this, with a node totaling more than 1.5 TB of GPU memory.

Llama-405B needs about 800 GB of memory just for the model weights. Even without considering KV cache, this excludes all other GPU types in single node configurations. Having 8x NVIDIA H100s only gives us 640 GB of total GPU memory. This would work (with limited context length) if we were using 8-bit data types, but does not work for the native 16-bit data type of the model. You would need two nodes and a total of 16x NVIDIA H100s to run Llama-405B in this evaluation.

Key Takeaways

NVIDIA is still the fastest

Unsurprisingly, NVIDIA GPUs regularly held the top spots in terms of raw tokens per second. Their mature software stack combines with a broad library of highly optimized and tuned GPU kernels to enable industry leading off-the-shelf performance.

AMD is the most cost-effective

Interestingly, AMD GPUs swept the field when it comes to the most price efficient accelerator. All of their datacenter-class GPUs have more memory capacity and bandwidth than their NVIDIA counterparts. This is important in memory-bandwidth bound regimes, which is typically (but not always) the case for LLM inference.

Long contexts put a premium on memory capacity

To run Llama-70B at its full 128k context would have required 4x NVIDIA A100s, 4x NVIDIA H100s, and 2x AMD MI300X, and 2x AMD MI250s. By maxing out the number of HBM memory on each GPU, AMD is able to demonstrate a cost advantage over NVIDIA. They simply need less GPUs to run the same models. NVIDIA makes up ground by outperforming the AMD GPUs in terms of performance-per-GPU.

Technical Breakdown of LLM Inference

LLM Inference Memory Requirements

We load 2 components into GPU memory during LLM inference: model weights and KV cache. Model weights take up bP bytes, where b is the number of bytes per parameter and P is the number of model parameters. For example, LLaMA 3.1 405B in FP8 precision takes up about b*P=1*405B=405 GB of memory.

KV cache are the key and value embeddings in the attention layer of past tokens. Since LLM inference is autoregressive, we avoid recomputation by storing and loading the embedding matrices. For every token in the KV cache, we store a key and a value embedding of size E. Assuming every attention layer has H attention KV heads and the model has L transformer layers, we can calculate that the KV cache holds 2EHL parameters per token. If we are serving batch size B with an average of T tokens per request, KV cache takes up 2bBTEHL bytes in total. For example, for LLaMA 3.1 405B, E=128, H=8, L=126. At batch size 128 with an average 256 tokens per request, KV cache takes up 2*2*128*256*128*8*126 bytes, which is roughly 17 GB (FP16 precision). Overall, LLM inference requires b*(P+2BTEHL) bytes.

Memory-bound and Compute-bound Regimes of LLM Inference

LLM inference workload involves memory loading and computation. Assuming the two operations are sufficiently overlapped, we model the performance as the maximum between memory loading and computation.

Pre-filling Phase

In the pre-filling phase, we load the model weights and perform approximately 2P FLOPs per token. Thus, the latency is max(2PBT/C, bP/M), where C is the GPU compute and M is the memory bandwidth. For the pre-filling phase, we are memory-bounded under a low batch size setting and compute-bounded when batch size is high.

Decoding Phase

For every token in the decoding phase, we perform 2P FLOPs, and we load model weights and the KV cache. Thus, every token takes max(2P/C,b*(P+2BTEHL)/M) seconds. For the decoding phase, we are always memory-bounded due to the high compute-to-memory-bandwidth ratio of GPUs and the fact that KV cache grows but computation doesn’t.

Shortcomings of this Benchmark

There are several out-of-scope tests that can and will be explored in future benchmarks.

8-bit Floating Point Data Types

Benchmarking using FP8 data types should be included in future testing. We expect this to give NVIDIA an advantage, as FP8 Llama-3.1 variants were available on NVIDIA hardware from day one. This was not the case of AMD. When using FP8, a limited-context version of Llama-405B could be deployed onto 8x H100s or 8x A100s. Similarly, FP8 would enable a user to deploy Llama-405B on 3x AMD MI300X.

Longer Context Benchmarking

Using an average input length of 250 tokens is convenient for benchmarking, but is not representative of real-world workloads. We have regularly seen inputs in the range of 1,000 to 10,000 tokens, with outputs closer to 100 to 1,000 tokens. Input to output token ratios of 100:1 and even 1,000:1 have been observed. A more interesting benchmark would be for long context inputs. The GPU memory advantage that AMD enjoys might lead to this type of benchmarking benefiting their hardware disproportionately in terms of cost efficiency.

Other System Level Optimizations

There are many optimizations that can be made, such as speculative decoding, chunked prefill, and GEMM tuning. We choose to ignore all of these for the sake of these benchmarks, but it would still be interesting to see how fully-tuned and performance-maxxed versions of these benchmarks would play out. We leave this to MLPerf.