CMSC 411 Lecture 16, Pipelining 1

    <- previous    index    next ->

Lecture 16, Pipelining 1

First, a few definitions:

Pipelining : Multiple instructions being executed, each in a different
             stage of their execution. A form of parallelism.

Super Pipelining : Advertising term, just longer pipelines.

Super Scalar : Having multiple ALU's. There may be a mix of some
               integer ALU's and some Floating Point ALU's.

Multiple Issue : Starting a few instructions every clock.
                 The CPI can be a fraction, 4 issue gives a CPI of 1/4 .

Dynamic Pipeline : This may include all of the above and also can
                   reorder instructions, use data forwarding and
                   hazard workarounds.

Pipeline Stages : For our study of the MIPS architecture,
                  IF   Instruction Fetch stage
                  ID   Instruction Decode stage
                  EX   Execute stage
                  MEM  Memory access stage
                  WB   Write Back into register stage

Hyper anything : Generally advertising terminology.

Consider the single cycle machine in the previous lecture.
The goal is to speed up the execution of programs, long sequences
of instructions. Keeping the same manufacturing technology, we can
look at speeding up the clock by inserting clocked registers at
key points. Note the placement of blue registers that tries to
minimize the gate delay time between any pair of registers.
Thus, allowing a faster clock.




This is called approximate because some additional design must
be performed, mostly on "control", that must now be distributed.
The next step in the design, for our project, is to pass the
instruction along the pipeline and keep the design of each
stage of the pipeline simple, just driven by the instruction
presently in that stage.



pipe1.vhdl implementation moves instruction
            note clock and reset generation
            look at register behavioral implementation
            instruction memory is preloaded

pipe1.out just numbers used for demonstration


Pipelined Architecture with distributed control




pipe2.vhdl note additional entities
            equal6 for easy decoding
            data memory behavioral implementation

pipe2.out instructions move through stages

Timing analysis

Consider four instructions being executed.
First on the single cycle architecture, needing 8ns per instruction.
The time for each part of the circuit is shown.
The clock would be:

 +---------------+               +---------------+               +------
 |               |               |               |               |
-+               +---------------+               +---------------+  

Single cycle execution  125MHZ clock
 0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17ns
 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
 +-------+---+-------+-------+---+
 |IF     |ID |  EX   |  MEM  |WB |
 +-------+---+-------+-------+---+
                                 +-------+---+-------+-------+---+
                                 |IF     |ID |  EX   |  MEM  |WB |
                                 +-------+---+-------+-------+---+
                                                                 +---
                                                                 |IF ... 24ns
                                                                 +---

                                                                      ... 32ns
The four instructions finished in 32ns.
An instruction started every 8ns.
An instruction finished every 8ns.

Now, the pipelined architecture has the clock determined by the slowest
part between clocked registers. Typically, the ALU. Thus use the same
ALU time as above, the clock would be:

 +---+   +---+   +---+   +---+   +---+   +---+   +---+   +---+   +---+
 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
-+   +---+   +---+   +---+   +---+   +---+   +---+   +---+   +---+   +-

Pipelined Execution 500MHZ clock   **
 +-------+-------+-------+-------+-------+
 |IF     |ID  reg|  EX   |  MEM  |reg WB |
 +-------+-------+-------+-------+-------+
         +-------+-------+-------+-------+-------+
         |IF     |ID  reg|  EX   |  MEM  |reg WB |
         +-------+-------+-------+-------+-------+
                 +-------+-------+-------+-------+-------+
                 |IF     |ID  reg|  EX   |  MEM  |reg WB |
                 +-------+-------+-------+-------+-------+
                         +-------+-------+-------+-------+-------+
                         |IF     |ID  reg|  EX   |  MEM  |reg WB |
                         +-------+-------+-------+-------+-------+
                                      **
 |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
 0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17ns

The four instructions finished in 16ns.  (But, the speedup is not 2)
An instruction started every 2ns.
An instruction finished every 2ns. Thus, the speedup is 8ns/2ns = 4 .

Since an instruction finishes every 2ns for the pipelined architecture and
every 8ns for the single cycle architecture, the speedup will be
 8ns/2ns = 4. The speedup would change with various numbers of instructions
if the total time was used. Thus, the time between the start or end of
adjacent instructions is used in computing speedup.

Note the ** above in the pipeline. The first of the four instructions
may load a value in a register. This load takes place on the falling
edge of the clock. The fourth instruction is the earliest instruction
that could use the register loaded by the first instruction. The
use of the register comes after the rising edge of the clock. Thus use
of both halves of the clock cycle is important to this architecture and
to many modern computer architectures.

Remember, every stage of the pipeline must be the same time duration.
The system clock is used by all pipeline registers.
The slowest stage determines this time duration and thus determines
the maximum clock frequency.

The worse case delay that does not happen often because of optimizing
compilers, is a load word, lw, instruction followed by an instruction
that needs the value just loaded. The sequence of instructions, for 
this unoptimized architecture, would be:
    lw   $1,val($0) load the 32 bit value at location val into register 1
    nop
    nop
    addi $2,21($1)  register 1 is available, add 21 and put result into reg 2

As can be seen in the pipelined timing below, lw would load register 1
by 9ns and register 1 would be used by addi by 10ns (**). The actual
add would be finished by 12 ns and register 2 updated sum by 15 ns (***).

             +-------+-------+-------+-------+-------+
lw $1,val($0)|IF     |ID  reg|  EX   |  MEM  |reg WB |
             +-------+-------+-------+-------+-------+
                     +-------+-------+-------+-------+-------+
nop                  |IF     |ID  reg|  EX   |  MEM  |reg WB |
                     +-------+-------+-------+-------+-------+
                             +-------+-------+-------+-------+-------+
nop                          |IF     |ID  reg|  EX   |  MEM  |reg WB |
                             +-------+-------+-------+-------+-------+
                                     +-------+-------+-------+-------+-------+
addi $2,21($1)                       |IF     |ID  reg|  EX   |  MEM  |reg WB |
                                     +-------+-------+-------+-------+-------+
                                                  **                  ***
             |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
             0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
             ns

It is interesting to note some similarity to an IBM Power PC that came
a few years after the MIPS R3000 architecture that is similar to the
above design.

IBM Power PC stages and clock usage

new IBM Power PC
Shipped 2012 at 5.5Ghz

    <- previous    index    next ->

Lecture 16, Pipelining 1

Pipelined Architecture with distributed control

Timing analysis

Other links

Go to top