Dataflow Architecture and AI - (2)
Rethinking Execution: From Instruction Streams to Dataflow Graphs
In a recent YouTube interview titled “Ilan Tayari, VP of Architecture, NextSilicon | Ian Interviews #48”, a fundamental architectural question was revisited: why do we compile programs into graphs, only to execute them as serialized instruction streams?
The discussion centers on NextSilicon’s non–Von Neumann, dataflow-based architecture. What makes the conversation particularly compelling is not just the claim of higher performance, but the structural critique of how modern CPUs and GPUs execute code.
As described in the interview:
“You take the intermediate representation, which is a graph, serialize it into an instruction stream, and then the processor reconstructs the graph.”
This statement captures a deep inefficiency in conventional processor design. Modern compilers lower source code into an intermediate representation (IR) that is fundamentally a dependency graph. However, that graph is then linearized into instructions. At runtime, hardware must rediscover dependency relationships dynamically using out-of-order execution, speculation, reorder buffers, and complex scheduling logic.
In effect, we:
- Construct a graph during compilation,
- Destroy it by serializing it into instructions,
- And then rebuild it in silicon at runtime.
The interview raises a direct and provocative question:
“Why do the lowering and then the lifting if you can just pass through that and bypass this transformation ?”
From Instruction-Level to Graph-Level Execution
Traditional processors optimize instruction-level parallelism (ILP). Even wide superscalar cores are constrained by issue width and the dynamic scheduling window. When we discuss IPC values—2, 4, perhaps 6 or higher—we are still fundamentally executing instructions one by one, albeit in parallel.
The dataflow model reframes the unit of execution. Instead of operating at the instruction level, it operates at the graph level. Dependencies are not inferred dynamically; they are embedded in the execution structure itself. The interview emphasizes this distinction clearly:
“It’s not two IPCs, 10 IPC. It’s the whole loop every cycle.”
The claim is not simply about wider issue. It is about sustaining throughput across the entire loop body by pipelining through the graph. If the graph exposes sufficient parallelism, functional units can remain active every cycle without repeatedly reconstructing dependency information.
Conceptually:
- CPUs execute instructions and infer dependencies.
- Dataflow processors execute dependencies directly. This shift in abstraction is substantial.
Efficiency Through Structural Simplification
A significant portion of CPU and GPU microarchitecture exists to support instruction-stream execution: instruction fetch units, instruction caches, decode logic, register renaming, reorder buffers, branch prediction, and speculative control.
In the interview, this overhead is explicitly contrasted with the dataflow approach:
“We don’t have an instruction fetch unit. We don’t have an instruction cache. We don’t need all of that.”
Eliminating these components does more than simplify the pipeline. It reallocates silicon area and power budget toward arithmetic units, memory bandwidth, and concurrency mechanisms. The potential result is improved energy efficiency and higher sustained utilization—particularly for workloads with rich dependency structures, such as HPC kernels and certain AI computations.
The Hardware–Software Boundary
It is important to clarify that dataflow execution does not eliminate the need for structured parallel programming. Parallelism must still be expressed in the source code. The architecture does not invent concurrency; it scales what is present.
However, once parallelism is represented in the IR graph, a dataflow processor can exploit it without collapsing it into a serialized abstraction. This creates a more natural alignment between compiler representation and hardware execution model.
Specialization vs. General-Purpose Dataflow
The interview also acknowledges an important nuance: a highly specialized accelerator—for example, one designed exclusively for LLM inference using reduced precision and domain-specific datapaths—may outperform a general-purpose dataflow processor for that narrow workload.
Specialization always yields peak efficiency within a defined scope.
Yet the broader argument remains compelling. A general-purpose dataflow processor can still deliver significant efficiency improvements over CPUs and GPUs while maintaining flexibility across heterogeneous HPC and AI workloads. If the same architectural principles were applied to domain-specific designs, the efficiency gains could be even more pronounced.
Concluding Perspective
For decades, performance improvements have largely been incremental refinements within the Von Neumann abstraction: deeper pipelines, wider SIMD, larger reorder buffers, and more aggressive speculation. These techniques optimize around the instruction-stream model.
Dataflow architectures challenge that model itself.
Since compilers already operate on dependency graphs, and since that structure is known at compile time, executing the graph directly in hardware removes a layer of structural inefficiency. The significance of this shift lies not merely in higher IPC, but in redefining the execution abstraction to better match the nature of computation.
If realized effectively, dataflow execution represents not just an optimization, but a structural evolution in processor architecture.