Scaling the Memory Wall
This article was written by Lux Capital summer associate Dario Soatto.
We’ve witnessed a revolution in artificial intelligence over the past few years. In 2018, the state-of-the-art, large-language model (OpenAI’s GPT-1) could churn out words that occasionally resembled sensible English sentences. Just six years later, GPT-4 is writing production-quality code and acing college exams. It’s difficult to overstate the speed and magnitude of AI development: a decade ago, cutting-edge models could barely differentiate cats and dogs — now, LLMs seem to steamroll every new benchmark we throw at them. Models have gone from the skills of roughly a toddler to a college student faster than an actual human grows.
How did this progress occur? In one word: scale. GPT-1 consisted of 117 million parameters trained on 4.5GB of text; GPT-4 was trained on terabytes of data, and it contains an estimated ~1.8 trillion parameters. The size of flagship LLM models has been increasing by over an order of magnitude each year, and although algorithmic improvements have played a role, we owe much of the credit to having more data and using it to train larger models. Despite some evidence that smaller models can perform well by performing more computation at inference time, scaling up models reliably leads to better model performance in real life use cases. As a result, model sizes will continue to grow at pace.
But there’s a problem: we need new infrastructure to efficiently support the ever growing demands of bigger models.
The Memory Wall
LLMs are computationally expensive: both training and inference requires processors to perform thousands of trillions of operations as fast as possible. Completing these countless vector and matrix multiplications is a challenge, but there’s an even more daunting, oft-overlooked problem: to use a model with tens to thousands of billions of parameters, we need to store hundreds of gigabytes of data close to the compute resources and constantly shuttle them back and forth to feed ALUs. Unfortunately, this task isn’t trivial.
The shared memory pool closest to compute cores is on the same chip and typically composed of SRAM. SRAM, however, occupies a prohibitive amount of physical space: just one GB of SRAM and its associated control logic would occupy 400mm2 of silicon, or about half the logic area of an Nvidia GPU, the most common type of chip for AI workloads. So, SRAM would cost hundreds of dollars per GB — an uneconomical price.
The next closest layer is DRAM, coupled off-chip memory. Although it’s two orders of magnitude cheaper than SRAM (single digit dollars per GB instead of hundreds), GPUs are limited in the amount of DRAM that they can connect to, and moving hundreds of gigabytes from DRAM to the chip and vice versa is also slow.
In some hardware, including Nvidia GPUs, DRAM is built up as a 3D stack known as HBM (high-bandwidth memory). HBM3e for example provides over 1TB/s of bandwidth. But HBM is limited as well: it has to be co-packaged with the processor on its shoreline, rendering the memory capacity limited by the length of the processor die circumference. HBM is also about 10x more expensive than standard DRAM.
As a result, we don’t have a scalable way of increasing the amount of data we can bring to bear on compute — that’s the memory wall, and it’s bottlenecking AI.
The Bottleneck
Each type of computation has a well-defined arithmetic intensity, which refers to the ratio of the amount of computation done to the quantity of data on which the computation requires. For example, if you multiply an N-dimensional vector by an N-by-N matrix, your memory footprint is 2N2 bytes (two bytes per brain floating point number), and you need to perform 2N2 computations (each of the N2 elements of the matrix need to be multiplied and added). Therefore, the arithmetic intensity of vector-matrix multiplication is 1, meaning that your system needs to deliver one byte to its compute cores for every computation that it does. A matrix-matrix multiplication, on the other hand, has a footprint of 4N2, but it performs 2N3 operations, so its arithmetic intensity is N.
The arithmetic intensity of a workload determines whether a hardware system is compute-limited or bandwidth-limited. If a system can’t deliver enough data to its processing cores to operate at maximum throughput, then the machine is bandwidth-limited; otherwise, it’s compute-limited.
Nvidia’s systems are severely bandwidth-limited. Although AI workloads are currently driving Nvidia’s growth, their GPUs were originally optimized for graphics-processing (hence the name GPU). As the AI market picked up steam, Nvidia added AI-centered optimizations to its hardware such as Tensor Cores. Between 2016 (P100) and 2024/5 (B200) Nvidia scaled BF16 ALU performance by a factor of 100x, from 21 TeraFLOPS to 2.25 PetaFLOPS for dense tensors. Because of this iterative approach, coupled with Nvidia’s graphics legacy, during the same period the memory capacity and memory bandwidth scaled only by 12x and 11x respectively. As a result, even though Nvidia GPUs’ rapid matrix multiplication puts them ahead of general hardware, they remain highly inefficient for AI.
AI algorithms — including transformers — have an arithmetic intensity significantly lower (fewer computations per byte of data) than the optimum for Nvidia GPUs. With only 40MB and 50MB of memory on chip, Nvidia’s A100 and H100 GPUs can’t store large models with on chip SRAM. As a result, they rely on DRAM packaged as high-bandwidth memory (HBM) chips co-packaged around its periphery — but each GPU has a limited “shoreline,” which constrains the amount of HBM it can access.
These limitations make Nvidia GPUs severely memory constrained, so using Nvidia’s architecture for AI inference requires dozens of GPUs working in concert, sharding the AI models while distributing the workloads across the GPUs. This all-to-all communication is slow, power-inefficient, and requires expensive networking gear. Despite these complex techniques to overcome GPU limitations and despite the best workload optimizations, the actual compute cores of Nvidia GPUs operate at around 60% capacity at best — they’re idling, waiting for data to move to and from compute resources.
As a result, our computing infrastructure can’t run at capacity, necessitating that we use an inordinate amount of computational resources and drastically increasing the cost of AI.
The Cost of the Memory Wall
With memory bottlenecks limiting our supply of computational power, corporations and research outfits are constantly competing over the scarce supply of Nvidia’s GPUs. Moreover, AI is driving unprecedented electricity demand that threatens to overwhelm the grid: in the U.S. alone, demand from data centers is projected to increase from 200 terrawatts in 2022 to 260 TW in 2026, which would account for 6% of all power use. Since U.S. power generation hasn’t budged in the past decade, we might not be ready for such a dramatic ramp up.
In the past few years, the staggering cost of training AI models has dominated headlines: OpenAI spent over $100M to train GPT-4, and the Silicon Valley rumor mill is awash with rumors of a $100B compute cluster for training.
While training draws the headline numbers, it’s actually the cost of inference which is more important. Although training costs are certainly hefty, software and hardware optimizations have decreased training costs by a factor of three over the past five years, and they’re one-time investments that get amortized over time. Inference costs, on the other hand, are recurring—and they’re expensive: OpenAI has the equivalent of 350,000 servers of Nvidia A100 GPUs exclusively dedicated to inference, and the company is projected to spend nearly $4 billion on inference in 2024, despite the heavily-discounted compute prices they receive from Microsoft.
Much of these inference costs stem from the memory wall, and the problem of expensive inference is here to stay.
A growing body of research has demonstrated the potential of improving AI model performance by increasing compute use during inference with techniques such as pruning, resampling, Monte Carlo Tree Search, and chain-of-thought reasoning. OpenAI’s o1 model, which uses significantly more compute time at inference, has served as a powerful illustration of this new direction. As a result, we can expect that the ratio of compute needed for inference, as opposed to training, will only increase going forward. We’ll need to scale up our AI inference infrastructure.
The Future
The shortcomings of Nvidia’s hardware aren’t lost on the Silicon Valley startup ecosystem: Nvidia, with its 75% gross margins, has a massive target on its back, and a number of startups are attempting to develop hardware that accelerates AI inference unconstrained by a legacy in graphics processing.
Unfortunately, the memory wall remains unsurmounted.
Cerebras, which is seeking to raise $1B from an IPO, has developed massive wafer scale chips with 900,000 cores. Unfortunately, they failed to anticipate the rapid growth of AI models, so their 44GB of on-chip memory, though an improvement on Nvidia, lacks the capacity for models with 100B+ parameters.
Similarly, Groq provides AI hardware with a variety of AI-specific optimizations that enable faster inference relative to Nvidia. Groq’s hardware, however, has only 230MB of on-chip memory and no separate high-bandwidth memory, so even running a medium-sized model, such as Llama-3 70B, requires hundreds of chips.
The Endgame
As models and context windows continue to grow and inference takes up a larger share of worldwide compute cycles, deploying AI at its full potential will necessitate the ability to perform inference with hardware that is fast, economical, and energy efficient. Hardware is hard, and taking on Nvidia only raises the bar. But with great problems come even greater opportunities — and the memory wall may be the greatest problem of the next few years.