The Limits of a Decoupled Out-of-Order Supercalar Architecture

Graham Jones

Graduated Ph.D. July 1999

Abstract

This thesis presents a study into a technique for improving performance in out-of-order superscalar architectures. It identifies three technological trends limiting superscalar performance; they are the increasing cost of a main memory access, control dependencies and the greater hardware complexity of out-of-order execution.

Decoupling is a technique that can provide higher performance through the mechanism of dynamically reordering, asynchronous instruction streams. It offers the capability to improve ILP, through effective latency hiding and dynamic scheduling, and to reduce hardware complexity, through decentralised logic. This thesis evaluates this capability, by investigating the effectiveness of decoupled out-of-order superscalar architectures.

This thesis identifies the degree to which operations can reorder (the degree of reordering) as the critical dimension to an out-of-order superscalar architecture. It investigates the effectiveness of decoupling by focusing on those design issues that determine the degree of reordering, and relaxes all other architectural constraints. This approach allows us to establish the limitations of decoupled out-of-order superscalar architectures.

This thesis shows that a decoupled architecture, through its dynamically reordering instructions windows, provides a possible solution to the problem of latency hiding and issue logic complexity. This thesis demonstrates that for large memory latencies, a decoupled architecture with 2 instruction streams is less sensitive to increases in memory latency than a conventional single stream superscalar architecture. The results also show that for memory latencies greater than 20 cycles, a decoupled architecture can achieve a higher speedup than a conventional superscalar architecture with twice the individual window sizes of a decoupled unit. An explanation for this effect is provided through the concept of the Effective Window Size.

The thesis also investigates a 3-stream decoupled superscalar architecture, that provides dedicated hardware support for resolving control dependencies. The results show that for the parititioning algorithm used in this thesis, the load balancing is poor and the extra hardware resources are under utilised. For this reason the majority of the thesis focuses on a 2 stream decoupled architecture.