October 16, 2024

The Next HPC Systems

#BLOGS

ELAD RAZ

The evolution of computing technology is nothing short of remarkable. Consider the Apollo Guidance Computer (AGC) that guided astronauts to the moon in 1966, operating at a mere 2MHz. Fast forward to today, and even the smartphones in our pockets boast speeds of 2-4GHz. This staggering leap illustrates not only advancements in clock speed but also a monumental increase in complexity and efficiency.

Yet, despite these advancements, the landscape of high-performance computing (HPC) is encountering significant challenges - in efficiency, adaptability, and real-world performance over theoretical peaks - which forces the industry to rethink how science can move forward.

The Not-So-Fast And Furious

Since the early days of the first high-performance computing, computer systems – starting with the ENIAC in 1945 – used processors that adhered to the Von Neumann architecture. These systems featured a central processing unit (CPU) that executes instructions from memory, possesses its own read/write memory state, and outputs results—the earliest processors utilized vacuum tubes, which were bulky, costly, and prone to malfunction. The subsequent revolution, sparked by the invention of the transistor, propelled the industry forward significantly. The Intel 4004, the first commercial microprocessor, housed 2,250 transistors and operated at 750 kHz. From that point, the race was on. The rapid growth in processing power continued throughout the late 20th century, with Moore’s Law predicting - and generally holding – that transistor density would roughly double every two years.

Companies competed to see who could shrink their transistors in size - from micrometer-scale now to 2nm, increasing the number of available transistors to make them more energy-efficient. As we entered the new millennium, transistor numbers grew to the tens of millions, and today’s HPC-specified accelerators have billions per chip - or over 100 billion as is the case with NextSilicon.

Over time, it became clear that these improvements needed to be more sustainable. As the limits of frequency scaling were reached, the focus shifted towards innovations like hyper-threading and multi-core processing. This evolution paved the way for new advancements and a focus on parallelism, thus bringing us to the new era - of Graphics Processing Units (GPUs).

The Great Shift

Today’s GPUs excel in parallel processing, offering many thousands of smaller cores that are designed to handle multiple tasks simultaneously. This architecture - originally designed for graphics rendering - has been adopted for machine learning, HPC, and other parallelizable workloads. As a result, this evolution has rendered the raw clock speed a less critical metric than in the past.

This shift can be seen in CPUs as well. Modern processors have made significant architectural improvements - such as multi-threading, multi-core designs, vector instructions, and out-of-order execution architectures - which allow them to handle multiple instructions per clock cycle (IPC). This is why a 4GHz processor today vastly outperforms a 4GHz processor from 2013, despite the identical clock speeds. The benchmark scores are not just about frequency; they reflect how these enhancements can enable more computational work to be done within each clock cycle.

While improvements in CPU performance have increased, it is clear that a shift to parallelism and the use of GPUs has benefits for many computationally intensive applications.

But while GPUs and parallelism may look promising compared to the traditional CPU, the question remains: can GPUs hold up against the growing demand for compute and data-intensive HPC and AI/ML workloads, or do they have limitations that are starting to hold back progress?

Speed-Limits

The landscape of HPC is encountering significant challenges, particularly with the continued reliance on GPUs. Despite their parallel processing strengths for graphics rendering and machine learning tasks, they have several notable drawbacks.

GPUs are NOT a Solution for Every Workload: CPUs, optimized for serial processing and single-thread performance, still outperform GPUs for tasks that require strong single-thread execution. Many applications, particularly those that rely on sequential processing, frequent branching (such as real-time decision-making), or global memory access (like graph algorithms or particle simulations), continue to rely heavily on CPUs to perform tasks that GPUs struggle to manage. This difference forces developers to refactor and re-optimize applications to support running across CPUs and GPUs for different tasks.

Complex Code Refactoring and Optimization: Porting applications from CPUs to GPUs involves significant rewriting of code. Developers must redesign algorithms and data structures to fully exploit GPU parallelism. The process is often slow and can take weeks or months of effort to achieve acceptable performance. This complexity can delay projects and require specialized skills in domain-specific languages (DSLs) and GPU programming models like CUDA or OpenCL.

Vendor Lock-In: Beyond the complexity they introduce, the DSLs and proprietary programming models most GPUs rely on, are not portable across different architectures. This creates a risk of vendor lock-in, where organizations become dependent on a single vendor’s technology roadmap, limiting their ability to adapt to evolving needs or switch to alternative platforms. Further adding to this vendor stickiness, is that GPUs use a unique memory management system, with various layers such as shared memory, caches, and global memory. Developers must meticulously manage data movement and memory access patterns to avoid bottlenecks, increasing development complexity even further compared to the simpler memory management of CPUs.

Infrastructure Costs and Power Consumption: Integrating GPUs into existing data centers often requires upgrades to power, cooling, and networking infrastructure. It's not unheard of for 1000 W accelerated compute GPUs and future hardware to consume up to 2000 W. In addition to the high acquisition costs of GPUs, organizations must contend with the ongoing expense of increased power consumption, which can even exceed the cost of the hardware over time.

Given these challenges, it’s clear that relying solely on GPUs is no longer a sustainable strategy for HPC companies. Instead, the industry requires a paradigm shift towards a new solution that embraces flexibility and efficiency—one that combines the best aspects of CPUs and GPUs without the constraints that have held back progress.

A Flexible Future

As we move forward, the goal is to foster a new era of intelligent acceleration. One that prioritizes adaptability, ease of developer adoption, and real-world results, over adherence to outdated performance metrics. By innovating beyond the traditional constraints of Moore’s law, we can redefine what’s possible in high-performance computing. I believe that we need to create a better system that leverages a flexible architecture to combine the strengths of CPUs and GPUs, turning what was once theory into reality.

Stay tuned for more insights as we explore this exciting journey into the future of computing!

About the Author:

Elad Raz is the founder and CEO of NextSilicon, a company pioneering a radically new approach to HPC architecture that drives the industry forward by solving its biggest, most fundamental problems.

RELATED INSIGHTS