Understanding High Performance Computing
“My hypothesis is that we can solve the software crisis in parallel computing, but only if we work from the algorithm down to the hardware — not the traditional hardware first mentality.” Tim Mattson, principal engineer at Intel.
In the past, there has been two counters that have been meaningful: Transistor size (shrinking is good), and CPU clock speeds (higher is faster).
Moore's Law is a term many people have heard, including people saying "Moore's law is dead" (such as Jensen Huang). But what really is Moore's Law?
A CPU is a silicon die made of transistors. Theoretically, increasing the number of transistors (logic gates) should increase the performance of CPUs. The idea of Moore's Law is that the number of transistors on a single CPU die doubles every 18-24 months.
To understand what this means, consider a silicon die that currently holds 15 million 14nm transistors. According to Moore's law, in 2 years, it will be able to hold 30 million transistors of 7nm length.
However, transistor size has been falling off. Intel's 14nm size came out in 2014. The 7nm die by AMD came out in 2019, for a total of 5 years. The 3nm die came out far later as well, 2023-2024 for a total of 4 years. Transistor shrinking is struggling to work out.
But what about CPU frequency?
In the past, CPU frequencies have been a strong measure of performance. CPUs that were clocked at 10Khz improved to 1Ghz, leading to massive performance uplifts. But now, frequencies don't hold much weight except for overclockers.
It turns out, increasing frequency is also falling off. Increasing a frequency by 25% will double a CPU power. This equates to large amounts of watts to cool, making it difficult for a system to keep this CPU's temperatures low. To understand why, note that a CPU is powered by volts. In some cases, you give the CPU more power to run at a higher clock, and then lower the voltage so it produces less heat. However, in modern day, this lowering voltage strategy has been providing minimum results. Just look at how undervolting CPUs in modern day such as the 13700K provides minimum power savings without obliterating performance.
We denote a concept of a power wall, meaning the maximum wattage a certain area of silicon could dissipate in a practical way. These frequency increases have hit the power wall, producing systems that struggle to maintain temperatures.
As a result, increasing frequency is not the best idea, hence why modern systems have max core clocks around 5ish for the past 4 years.
The truth is, single-core raw processor improvement ended in 2003. So what is the future of high performance computing? How are systems still getting faster?
Graph of modern trends, notice plateau of sequential, frequency, and watts. Notice
how watts stuck as increasing it more makes it difficult to cool.
In the modern day, HPC relies on a lot of parallelism. Some examples are:
- Data-level parallelism (DLP)
- Thread-level parallelism (TLP)
Data-level Parallelism. The idea is to split data into partitions, operate on them, and then concatenate them. For example, addition of two arrays can be optimized by splitting them into 4 different small arrays, then running the addition all at once before merging the results back to a single array.
Thread-level Parallelism. The idea for a single job to be divided into threads. For example, suppose we have a web server. It gets 3 different request. Using 3 threads, we complete them all at once.
Current issues with modern computing:
- Diminishing Returns on Attempts to Exploit Instruction-level Parallelism
- Power and heat issues
- Memory latency
Well, we have all our parallelism stuff, but how do we know it's actually improving since it's not directly comparable to our previous processors?
Welcome to benchmarks! Benchmarks provide us a way to compare CPU A to CPU B for a given task, no matter their architecture.
We have to common measurements:
- MFLOPS (Millions of Floating Point Operations Per Second)
- MIPS (Million of Instruction Per Second)
Processor Performance Equation (PPE) for a given program:
Instructions/Program * Cycles/Instruction * Time/Cycle.
- Instructions/Program: Instructions executed and not static code size, determined by algorithm, compiler, or ISA.
- Cycles/Instruction (CPI): Amount of CPU cycles it takes to conduct an instruction. Determined by ISA, CPU organization.
- Time/Cycle: The amount of time it takes for a CPU to clock. Determined by technology, hardware, circuit design, etc.
Goal of HPC Benchmarking:
Minimize all 3, not a single one.
Example:
Consider the following table. Will decreasing Store instruction cycle count to 1 at a sacrifice of 15% clock speed speed or slow down processor?
CPI of Old: 0.43 + 0.21 + 0.12 * 2 + 0.24 * 2 = 1.36
New: 0.43 + 0.21 + 0.12 + 0.24 * 2 = 1.24
We notice CPI decreases. What about actual speed?
CPI * FREQ vs New CPI new * FREQ:
1.36 x 1 vs 1.24 x 1.15
1.36 vs 1.43
Our change, while decreasing CPI, actually slows down the processor so we do not make the change. This is a highlight how we do not try to aim for one metric.
So, what else exists for the limits of hardware computation speed?
Let's talk about Amdahl's law. It is a formula arguing the maximum upper bound of speedup from adding more resources (cores, memory, etc).
- f : The amount of vectorizable parts of the program (can run in parallel).
- 1 - f : The amount of serial parts that have to be ran sequentially.
- N : Speedup for the f portion (for example, perhaps 3 cores allow a 2.25x uplift compared to 1)
The formula is as follows:
Note how even if N approaches infinity, it reaches a performance bottleneck, as visualized by the graph below.
Picture from wikipedia, https://en.wikipedia.org/wiki/Amdahl%27s_law
This concludes the concept of Amdalh's law. Are there any other limitations on computing? Yes, they are further down to be covered.
Conclusion. We studied limitations on computer hardware. We also learned about previous CPU technology such as their scaling, as well as studying various ways of comparing benchmarks. We also learned about ceilings of modern CPU metholds.
The main takeaway is how modern HPC computing relies severly on parallel computing systems and architecture.