Instruction–Level
Parallelism: Instruction Prefetch
Break
up the fetch–execute cycle and do the two in parallel.
This
dates to the IBM Stretch (1959)
The
prefetch buffer is implemented in the CPU with on–chip registers.
The
prefetch buffer is implemented as a single register or a queue.
The CDC–6600 buffer had a queue of length 8 (I think).
Think
of the prefetch buffer as containing the IR (Instruction Register)
When
the execution of one instruction completes, the next one is already
in the buffer and does not need to be fetched.
Naturally, a program branch (loop structure,
conditional branch, etc.)
invalidates the contents of the prefetch buffer, which
must be reloaded.
Instruction–Level
Parallelism: Pipelining
Better
considered as an “assembly line”
Note
that the throughput is distinct from the time required for the execution of a
single instruction. Here the throughput
is five times the single instruction rate.
What About
Two Pipelines?
Code
emitted by a compiler tailored for this architecture has the possibility to run
twice as fast as code emitted by a generic compiler.
Some
pairs of instructions are not candidates for dual pipelining.
C = A + B C = A + B
D = A + C C = C / D
Superscalar
Architectures
Having
2, 4, or 8 completely independent pipelines on a CPU is very
resource–intensive and not directly in response to careful analysis.
Often,
the execution units are the slowest units by a large margin. It
is usually a better use of resources to replicate the execution units.