Changes
Page history
Add page on memory wall
authored
Jan 19, 2021
by
Jan Eitzinger
Show whitespace changes
Inline
Side-by-side
Memory-Wall.md
0 → 100644
View page @
2d184dfa
In the HPC community the term
*Memory Wall*
was used in the late nineties to
describe the situation where flops are cheap and main memory bandwidth is
precious and scarce. The narrative was, that it is much easier to increase the
instruction throughput of a processor architecture than its main memory
bandwidth. As we all know things went different and while there is a gap
between instruction throughput and memory bandwidth capabilities it is for sure
not dramatically opening.
The view on processor architectures at that time was influenced by the fact
that most application codes where classical numerical codes which are in most
cases limited by main memory bandwidth. The impression was that engineers
built processors no one (well at least no one in the HPC community) could fully
exploit. Today the HPC application landscape is more diverse and multiple
applications making full use of the instruction throughput capabilities of
processors, e.g., from in the molecular dynamics domain.
Lets look at real data to investigate how the memory wall developed over
time. The following graph shows peak memory bandwidth (black line) and peak
double precision flop rates (red line) for one socket of Intel server chips in
the last 35 years. The y-axis (MB/s for the bandwidth and MFlops/s for peak
performance) is in logarithmic scale! The dashed vertical line separates
processors of the single-core and multi-core eras. The number behind the
processor micro-architecture is the frequency. We choose top bin variants for
every generation. Because modern processors run with various frequencies we
choose the applicable scalar or SIMD Turbo frequency.

Starting with the multi-core era a gap is opening between peak performance and
main memory bandwidth. Bumps in performance occur for Pentium 4, SandyBridge,
Haswell and Skylake. Those processors introduced new processor features: Data
parallel processing (16b-wide SSE, 32b-wide AVX, and 64b-wide AVX512) and
fused-multiply add instructions (a one time stunt added in Haswell processors).
Lets look at this plot again for scalar code not using the new features. Now
the performance is solely driven by the product of the number of cores,
frequency, and superscalar execution.

No gap opens anymore! The peak performance and the memory
bandwidth develop in parallel over the years. What about all the cores added?
And there we are at a new wall processor architecture are facing: The power
wall. Processors are sophisticated heating plates, the performance limit is
governed by the total heat you can dissipate across the tiny die surface. Due
to this limitation hardware architects trade cores for frequency, from 3.6 GHz
in Pentium D to 2 GHz nowadays. To give the impression of a performance
increase the engineers silently increased the TDP (Thermal design power) from
96W in Nehalem to 205W in Cascadelake. With standard DRAM technology you
achieve more than 150GB/s on one socket. And with High Bandwidth Memory
(HBM) used in some HPC chips (e.g., NEC Aurora and Fujitsu A64FX)
1 TB/s and more is possible.