The Limits of a Decoupled Out-of-Order Supercalar Architecture
Graham Jones
Graduated Ph.D. July 1999
Abstract
This thesis presents a study into a technique for improving performance
in out-of-order superscalar architectures. It identifies three
technological trends limiting superscalar performance; they are the
increasing cost of a main memory access, control dependencies and the
greater hardware complexity of out-of-order execution.
Decoupling is a technique that can provide higher performance through the
mechanism of dynamically reordering, asynchronous instruction streams.
It offers the capability to improve ILP, through effective latency hiding
and dynamic scheduling, and to reduce hardware complexity, through
decentralised logic. This thesis evaluates this capability, by investigating
the effectiveness of decoupled out-of-order superscalar architectures.
This thesis identifies the degree to which operations can reorder (the
degree of reordering) as the critical dimension to an out-of-order
superscalar architecture. It investigates the effectiveness of
decoupling by focusing on those design issues that determine the degree
of reordering, and relaxes all other architectural constraints. This
approach allows us to establish the limitations of decoupled out-of-order
superscalar architectures.
This thesis shows that a decoupled architecture, through its dynamically
reordering instructions windows, provides a possible solution to the
problem of latency hiding and issue logic complexity. This thesis
demonstrates that for large memory latencies, a decoupled architecture
with 2 instruction streams is less sensitive to increases in memory
latency than a conventional single stream superscalar architecture.
The results also show that for memory latencies greater than 20 cycles,
a decoupled architecture can achieve a higher speedup than a conventional
superscalar architecture with twice the individual window sizes of a
decoupled unit. An explanation for this effect is provided through the
concept of the Effective Window Size.
The thesis also investigates a 3-stream decoupled superscalar architecture,
that provides dedicated hardware support for resolving control dependencies.
The results show that for the parititioning algorithm used in this thesis,
the load balancing is poor and the extra hardware resources are under
utilised. For this reason the majority of the thesis focuses on a
2 stream decoupled architecture.