Maverick-2: A Deeper Dive
Today, I’m excited to share for the first time the comprehensive details behind NextSilicon's Maverick-2, revealing our dataflow architecture innovations, our initial performance results, and vision for the future that we've been building toward.
This moment represents the culmination of eight years of relentless innovation and dedication. From our earliest days challenging the fundamental assumptions of computing architecture, to seeing Maverick-2 power production systems at world-class institutions like the Spectra supercomputer at Sandia National Laboratories, this journey has been extraordinary. I'm incredibly proud of the NextSilicon team whose brilliant engineering, unwavering commitment, and bold thinking made this revolutionary architecture a reality. Today, we're sharing proof that the future of computing is already here.
The computing industry has backed itself into a corner. Organizations building the next generation of AI and HPC applications face an impossible choice between three fundamentally limited options, and the limitations of each are becoming more apparent every day. I’m pleased to share a more in-depth look at how we've solved this fundamental challenge, along with our initial benchmark data that proves there's a better path forward.
The Three Imperfect Paths
Current computing architectures have served the industry remarkably well for decades, powering everything from personal computing to cloud infrastructure. But the explosive growth of AI and HPC workloads has exposed fundamental misalignments in what they can offer. Whether it's architectural inefficiency, programming complexity, or prohibitive costs and inflexibility—CPUs, GPUs, and ASICs each force organizations to compromise.
CPUs remain the flexible, universally programmable foundation of computing, but they're shackled by an 80-year-old Von Neumann architecture where roughly 98% of silicon is dedicated to control overhead—branch prediction, out-of-order logic, instruction handling—with only 2% performing actual computation. You're essentially paying for chips where most of the hardware doesn't solve your actual problems.
GPU-based accelerated infrastructure offers better parallel performance, but introduces significant complexity. To use them effectively, you need specialized programming languages like CUDA, must manage intricate memory hierarchies and cache coherency. The result? Lengthy adoption cycles and applications that are locked-in to an ecosystem and difficult to port between different acceleration platforms.
Fully workload-optimized ASICs, like the custom silicon hyperscalers build for specific AI tasks, deliver exceptional performance and efficiency for their target application. But they come with a brutal price tag: $150+ million investments, 3+ year development cycles, and hardware that becomes rigid and brittle the moment your workload evolves. This path is only viable for companies with massive internal deployment fleets and the deepest pockets. For everyone else, it's simply not an option.
For enterprises and organizations navigating this landscape, the question has been: which compromise can we live with?
Introducing the Alternative Path
Today, we're disclosing detailed architecture and performance results that prove there's a fourth option, one that doesn't require compromise.
Maverick-2, built on NextSilicon's Intelligent Compute Architecture (ICATM), leverages a novel dataflow hardware design that addresses the fundamental shortcomings of all three traditional approaches. And the results speak for themselves: up to 10x the performance of leading GPUs while consuming as much as 60% less power—achieved with unmodified, out-of-the-box code.
Unlike CPUs and GPUs, our non-Von Neumann dataflow architecture eliminates instruction handling overhead and memory bottlenecks entirely. Data availability drives computation, not the other way around. We've also shifted the silicon allocation ratio, devoting the majority of hardware real estate to actual computation rather than control overhead.
Unlike ASICs, our software-defined hardware adapts in real-time to different workloads. Whether you're running today's transformer models or tomorrow's breakthrough algorithms, Maverick-2 automatically reconfigures to optimize performance. You get the majority of workload-optimization benefits without sacrificing flexibility. As your applications and algorithms evolve, so does Maverick.
Our software-defined dataflow design means organizations can achieve a near ‘ASIC-class’ performance and efficiency on their specific applications while maintaining the versatility to adapt as algorithms evolve. All without the prohibitive costs, multi-year timelines, or vendor lock-in that come with custom silicon development.
Why Dataflow? And Why Now?
To understand why Maverick-2 represents a fundamental breakthrough, we need to explain what dataflow computing actually is, and why it's been the ‘holy grail’ of computer architecture for decades.
What is Dataflow?
In a dataflow architecture, computation is driven by data availability rather than instruction sequences. Instead of a program counter stepping through instructions one by one (the Von Neumann model), a dataflow processor consists of a grid of computational units (ALUs) interconnected in a graph structure. Each unit is configured to perform a specific operation—addition, multiplication, logical operations, etc. When input data arrives at a unit, computation triggers automatically, and the result flows to the next unit in the graph.
Think of it this way: In a traditional processor, you have a cookbook (program) that you follow step-by-step, regardless of whether the ingredients (data) are ready. In a dataflow processor, each cooking station activates the moment its ingredients arrive, working in parallel with other stations. The recipe isn't executed sequentially, it flows naturally based on what's ready to be processed. Crucially, once a cooking station completes its task, it's immediately available to start the next order. This continuous flow eliminates idle cycles and maximizes throughput.
This fundamental difference has profound implications. Traditional processors spend most of their silicon managing this sequential execution—tracking instruction order, predicting branches, scheduling operations out-of-order to hide latency. Dataflow eliminates all of that overhead because there are no instructions to fetch, decode, or schedule. Data simply flows through compute.
Why is dataflow superior?
The advantages are compelling. By eliminating instruction handling overhead, dataflow architectures can dedicate the vast majority of silicon to actual computation. There's no branch prediction that can mispredict. No out-of-order execution logic consuming power and die area. No instruction caches competing with data for memory bandwidth. Just compute units processing data as it arrives.
For parallel workloads, which describes virtually all modern AI and HPC applications, dataflow's ability to exploit natural parallelism in the computational graph means hundreds of operations can execute simultaneously, limited only by data dependencies rather than artificial instruction serialization.
So why hasn't dataflow dominated computing?
The answer lies in a challenge that has plagued every dataflow attempt: programmability. Previous dataflow architectures required developers to completely rewrite applications using specialized spatial programming languages. You couldn't just run existing C++ or Python code, you had to fundamentally rethink how you expressed computation using dataflow graphs. This created an insurmountable adoption barrier. Even when dataflow hardware showed impressive performance in research labs, it remained trapped there because real-world enterprises couldn't justify rewriting millions of lines of production code.
NextSilicon solved the programmability problem.
Our Intelligent Compute Architecture combines the raw efficiency of dataflow hardware with a sophisticated software layer that makes it accessible. Our system identifies the most computationally intensive parts of your code and dynamically optimizes the hardware to accelerate them in real-time—no special programming languages, no manual optimization required.
Your existing applications just work.
The breakthrough isn't just the dataflow hardware, it's the intelligent software that makes dataflow practical for the first time. We've taken decades of theoretical advantages and made them accessible to real-world applications running real-world code.
Proven Performance, Real-World Validation.
Maverick-2 isn't vaporware or a roadmap promise. It's already running at dozens of customer sites worldwide, including Sandia National Laboratories in its Vanguard-II supercomputer. We have significant updates coming from our customer on their system-level performance testing results. In the meantime, we are pleased to share the benchmark results from our internal testing, which demonstrate Maverick-2's competitive advantages across critical workloads. These are starting points for proof of performance, with improvements expected as scale testing continues:
GUPS (Giga-Updates Per Second): 32.6 GUPS at 460 watts—22x faster than CPUs and nearly 6x faster than GPUs for applications like high-throughput databases, agentic AI real-time decision making, and scattered-data AI inference.
HPCG (High-Performance Conjugate Gradients): 600 GFLOPS at 750 watts, matching leading GPU performance while consuming half the power in production. Notably, these results were achieved without the months of firmware tuning and BIOS optimization that competitors require.
PageRank: 10x higher graph analytics performance than leading GPUs. But here's what's even more remarkable: at large graph sizes (25GB+), leading GPUs failed to complete the benchmark entirely, while Maverick-2 processed them effortlessly - demonstrating the critical need for adaptive architectures capable of handling the complex workloads driving modern AI, social analytics, and network intelligence.
Every benchmark was achieved using unmodified, out-of-the-box application code - no specialized programming, no vendor-specific optimizations, no lengthy porting cycles. These represent our initial performance baselines. Expect improvements to these numbers along with additional benchmark results as we expand testing across these workloads and scale with customers.
Four Keys to True Flexibility
What makes Maverick-2's approach fundamentally different? It's not just about raw performance—it's about delivering that performance in a way that actually works for real-world enterprises. The benchmark results we've shared today demonstrate that we're delivering on the promise of our Intelligent Compute Architecture, which combines four critical capabilities necessary for next-generation AI and HPC:
1. Drop-in Replacement: Run unmodified C++, Python, Fortran, CUDA, and AI framework code out of the box, eliminating costly porting cycles and vendor lock-in.
2. Adaptive, Real-Time Optimization: The architecture continuously monitors applications and dynamically reconfigures hardware in nanoseconds to accelerate the most critical code paths, getting smarter and faster over time.
3. Efficient at Scale: By dedicating the majority of silicon to computation rather than overhead, Maverick-2 delivers up to 10x performance improvements at half the power consumption.
4. Future-Proof Design: The software-defined dataflow approach adapts automatically to evolving AI models and HPC algorithms without requiring new specialized hardware units or chip redesigns.
One More Thing
You know the drill. There's always "one more thing."
When we started NextSilicon, we didn't just want to build another accelerator. We wanted to prove what true innovation looks like in this industry, creating architecture that evolves with algorithms, systems flexible enough to handle whatever comes next.
But here's what we learned building Maverick-2: Every breakthrough application has two fundamental requirements. There's massive parallel work that our dataflow grid handles seamlessly. And there's focused, serial control logic that needs to execute with lightning speed.
So we built our own RISC-V cores right into Maverick-2 to handle those serial code paths that can't be parallelized but still need to run fast. And you know what? Those cores performed so well that we started thinking: What would be possible if we built a dedicated test chip optimized purely for lightning-fast serial processing?
Today, I'm excited to unveil Arbel - an enterprise-grade RISC-V performance core built entirely from the ground up.
Arbel's Technical Foundation
Arbel delivers breakthrough performance through four key architectural innovations:
Massive instruction pipeline with 10-wide issue width and a 480-entry reorder buffer, allowing Arbel to see more of the problem at once and maximize core utilization.
Core frequency of 2.5 GHz delivers high single-thread performance while maintaining power efficiency.
Wide execution unit supporting 16 scalar instructions in parallel, plus four integrated 128-bit vector units for exceptional performance on data-parallel workloads.
Sophisticated memory subsystem with 64KB L1 cache and large shared L3, keeping data close and cores continuously fed - addressing the memory bandwidth and latency bottlenecks that constrain modern applications.
Elite TAGE branch predictor ensures faster, more accurate decision-making with fewer mispredictions and less wasted work.
This is real silicon built on TSMC's 5nm process—our own patented IP, not licensed or borrowed. Built by NextSilicon engineers for NextSilicon's vision of the future.
Strategic Exploration, Not Pivot
Now, you might be thinking: Is NextSilicon becoming a CPU company?
Not exactly. But we are exploring something much more interesting.
We're seeing tremendous customer interest in Arbel, and it's opened our eyes to the same opportunity that AMD and NVIDIA have recognized: the power of vertical integration between CPU and accelerator technologies. When you control both general-purpose computing and specialized acceleration, you can optimize the entire stack in ways that simply aren't possible when you're dependent on someone else's CPU architecture.
Arbel demonstrates NextSilicon's fundamental capability to innovate at the deepest levels of computing architecture. When customers partner with us, they're not just getting an accelerator, they're partnering with a company capable of building world-class processors that can solve problems others can't.
The Path Forward
For enterprises navigating the complexity of modern AI and HPC infrastructure, the computing trilemma is no longer inevitable. Maverick-2 represents the optimal balance: workload-optimized performance with universal programmability, ASIC-class efficiency without multi-year development cycles, and immediate acceleration without the vendor lock-in that has plagued the industry for decades.
With Maverick-2's dataflow architecture already transforming computing, and Arbel showcasing our ability to engineer world-class silicon from the ground up, we're proving that the future of computing doesn't require choosing which compromise you can live with. It requires rethinking the architecture from the ground up.
CTA:
Want to dive deeper? Watch the full technical disclosure webinar featuring NextSilicon Founder and CEO Elad Raz, VP of Architecture Ilan Tayari, and VP of R&D Eyal Nagar as they walk through the architecture innovations, benchmark results, and roadmap that are redefining what's possible in AI and HPC computing.
Stay Connected
Stay tuned as we share more details in future blogs and announcements via our Nextletter.
About the Author:
Elad Raz is the founder and CEO of NextSilicon, a company pioneering a radically new approach to HPC architecture that drives the industry forward by solving its biggest, most fundamental problems.