Lecture 19, Pipelining Data Forwarding


  Data forwarding example   CMSC 411 architecture

  Consider the five stage pipeline architecture:

  IF instruction fetch, PC is address into memory fetching instruction
  ID instruction decode and register read out of two values
  EX execute instruction or compute data memory address
  M  data memory access to store or fetch a data word
  WB write back value into general register


         IF       ID          EX        M       WB
    +--+     +--+        +--+     +--+     +--+
    |  |     |  |        | A|-|\  |  |     |  |
    |  |     |  |    /---|  | \ \_|  |     |  |
    |PC|-(I)-|IR|-(R)  = |  | / / |  |-(D)-|  |--+
    |  |     |  |  ^ \---| B|-|/  |  |     |  |  |
    +--+     +--+  |     +--+     +--+     +--+  |
     ^        ^    |      ^   ALU  ^        ^    |
     |        |    |      |        |        |    |
 clk-+--------+-----------+--------+--------+    |
                   |                             |
                   +-----------------------------+

  Now consider the instruction sequence:

  400  lw  $1,100($0)  load general register 1 from memory location 100
  404  lw  $2,104($0)  load general register 2 from memory location 104
  408  nop
  40C  nop             wait for register $2 to get data
  410  add $3,$1,$2    add contents of registers 1 and 2, sum into register 3
  414  nop
  418  nop             wait for register $3 to get data
  41C  add $4,$3,$1    add contents of registers 3 and 1, sum into register 4
  420  nop
  424  nop             wait for register $4 to get data
  428  beq $3,$4,-100  branch if contents of register 3 and 4 are equal to 314
  42C  add $4,$4,$4    add ..., this is the "delayed branch slot" always exec.


  The pipeline stage table with NO data forwarding is:

  lw   IF ID EX M  WB
  lw      IF ID EX M  WB
  nop        IF ID EX M  WB
  nop           IF ID EX M  WB
  add              IF ID EX M  WB
  nop                 IF ID EX M  WB
  nop                    IF ID EX M  WB
  add                       IF ID EX M  WB
  nop                          IF ID EX M   WB
  nop                             IF ID EX M  WB
  beq                                IF ID EX M  WB
  add                                   IF ID EX M  WB

  time 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16


This can be significantly improved with the addition of four
multiplexors and wiring.



         IF       ID                  EX          M       WB
    +--+     +--+           +--+          +--+       +--+
    |  |     |  |           | A|-(X)--|\  |  |       |  |
    |  |     |  |    /-(X)--|  | | |  \ \_|  |       |  |
    |PC|-(I)-|IR|-(R)   | = |  | | |  / / |  |-+-(D)-|  |--+
    |  |     |  |  ^ \-(X)--| B|-(X)--|/  |  | |     |  |  |
    +--+     +--+  |    |   +--+ | |      +--+ |     +--+  |
     ^        ^    |    |    ^   | |  ALU  ^   |      ^    |
     |        |    |    |    |   | |       |   |      |    |
 clk-+--------+--------------+-------------+----------+    |
                   |    |        | |           |           |
                   |    +----------+-----------+           |
                   |             |                         |
                   +-------------+-------------------------+

  The pipeline stage table with data forwarding is:

  lw   IF ID EX M  WB
  lw      IF ID EX M  WB
  nop        IF ID EX M  WB                 saved one nop
  add           IF ID EX M  WB              $2 in WB and used in EX
  add              IF ID EX M  WB           saved two nop's $3 used
  nop                 IF ID EX M WB         saved one nop        
  beq                    IF ID EX M  WB     $4 in MEM and used in ID
  add                       IF ID EX M  WB 

  time 1  2  3  4  5  6  7  8  9  10 11 12


  Note the required nop from using data immediately after a load.
  Note the required nop for the beq in the ID stage using an ALU result.


The data forwarding paths are shown in green with the additional
multiplexors. The control is explained below.



Green must be added to part2a.vhdl.
Blue already exists, used for discussion, do not change.

To understand the logic better, note that MEM_RD contains the register
destination of the output of the ALU and MEM_addr contains the value
of the output of the ALU for the instruction now in the MEM stage.

If the instruction in the EX stage has the MEM_RD destination in
bits 25 downto 21, then MEM_addr must be routed to the A side of the ALU.
(This is the A forward MEM_addr control signal.)

                   EX stage          MEM stage
                 add $4,$3,$1       add $3,$1,$2
                         |               |
                         +---------------+


If the instruction in the EX stage has the MEM_RD destination in
bits 20 downto 16, then MEM_addr must be routed to the B side of the ALU.
(This is the B forward MEM_addr control signal.)

                   EX stage          MEM stage
                 add $4,$1,$3       add $3,$1,$2
                            |            |
                            +------------+


To understand the logic better, note that WB_RD contains the register
destination of the output of the ALU or Memory and WB_result contains
the value of the output of the ALU or Memory for the instruction now
in the WB stage.

If the instruction in the EX stage has the WB_RD destination in
bits 25 downto 21, then WB_result must be routed to the A side of the ALU.
(This is the A forward WB_result control signal.)

If the instruction in the EX stage has the WB_RD destination in
bits 20 downto 16, then WB_result must be routed to the B side of the ALU.
(This is the B forward WB_result control signal.)

Note that a beq instruction in the ID stage that needs a value from
the instruction in the WB stage does not need data forwarding.

A beq instruction in the ID stage has the MEM_RD destination in
bits 25 downto 21, then MEM_addr must be routed to the top side of
the equal comparator.
(This is the 1 forward control signal.)

A beq instruction in the ID stage has the MEM_RD destination in
bits 20 downto 16, then MEM_addr must be routed to the bottom side of
the equal comparator.
(This is the 2 forward control signal.)

           ID stage        EX stage        MEM stage
         beq $3,$4,-100      nop         add $4,$3,$1
                 |                            |
                 +----------------------------+



A beq instruction in the ID stage has the WB_RD destination in
bits 20 downto 16, then WB_result must be used by the bottom side of
the equal comparator.
(This happens by magic. Not really, two rules above apply.)

           ID stage        EX stage    MEM stage    WB stage
         beq $3,$4,-100      nop         nop       lw $4,8($3)
                 |                                     |
                 +-------------------------------------+




  The data forwarding rules can be summarized based on the
  cs411 schematic, shown above.

  ID stage beq data forwarding: 

      default with no data forwarding is ID_read_data_1      
      1 forward MEM_addr is  ID_reg1=MEM_RD and MEM_rd/=0 and MEM_OP/=lw 
  
      default with no data forwarding is ID_read_data_2
      2 forward MEM_addr is  ID_reg2=MEM_RD and MEM_rd/=0 and MEM_OP/=lw 

  EX stage data forwarding:

      default with no data forwarding is EX_A
      A forward MEM_addr is  EX_reg1=MEM_RD and MEM_RD/=0 and MEM_OP/=lw
      A forward WB_result is  EX_reg1=WB_RD and WB_RD/=0

      default with no data forwarding is EX_B
      B forward MEM_addr is  EX_reg2=MEM_RD and MEM_RD/=0 and MEM_OP/=lw
      B forward WB_result is  EX_reg2=WB_RD and WB_RD/=0

      Note: the entity mux32_3 is designed to handle the above.

  ID_RD is 0 for ID_OP= beq, j, sw (nop, all zeros, automatic zero in RD)
           thus EX_RD, MEM_RD,  WB_RD = 0 for these instructions
           Because register zero is always zero, we can use 0 for
           a destination for every instruction that does not
           produce a result in a register. Thus no data forwarding
           will occur for instructions that do not produce a value
           in a register.


  note: ID_reg1 is ID_IR(25 downto 21)
        ID_reg2 is ID_IR(20 downto 16)
        EX_reg1 is EX_IR(25 downto 21)
        EX_reg2 is EX_IR(20 downto 16)
        MEM_OP  is MEM_IR(31 downto 26)
        EX_OP   is EX_IR(31 downto 26)
	ID_OP   is ID_IR(31 downto 26)

        These shorter names can be used with  VHDL alias statements

        alias  ID_reg1 : word_5 is ID_IR(25 downto 21);
        alias  ID_reg2 : word_5 is ID_IR(20 downto 16);
        alias  EX_reg1 : word_5 is EX_IR(25 downto 21);
        alias  EX_reg2 : word_5 is EX_IR(20 downto 16);
        alias  MEM_OP  : word_6 is MEM_IR(31 downto 26);
        alias  EX_OP   : word_6 is EX_IR(31 downto 26);
        alias  ID_OP   : word_6 is ID_IR(31 downto 26);


Why is the priority mux, mux32_3 needed?
mux32_3.vhdl gives priority to ct1 over ct2

Answer: Consider MEM_RD with a destination value 3 and
WB_RD with a destination value 3.

What should   add $4,$3,$3 use? MEM_addr or WB_result ?

For this to happen, some program or some person would have
written code such as:

     sub  $3,$12,$11
     add  $3,$1,$2
     add  $4,$3,$3   double the value of $3

Well, rather obviously, the result of the  sub  is never used and
thus the answer to our question is that MEM_addr must be used. This
is the closest prior instruction with the required result. The
correct design is implemented using the priority mux32_3 with the
MEM_addr in the  in1  priority input.


The control signal  A forward MEM_addr  may be implemented in VHDL as:



btw: 100011 in any_IR(31 downto 26) is the  lw  opcode in this example,
     be sure to check this semesters cs411_opcodes.txt


Here is where you may want to add a debug process. Replace AFMA
with any signal name of interest:

   prtAFMA: process (AFMA)
             variable my_line : LINE; -- my_line needs to be defined
           begin
             write(my_line, string'("AFMA="));
             write(my_line, AFMA);         -- or hwrite for long signals
             write(my_line, string'(" at="));
             write(my_line, now);         -- "now" is simulation time
             writeline(output, my_line);  -- outputs line
           end process prtAFMA;


part2a.chk has the _RD signals and values


cs411_opcodes.txt for op code values

Now, to finish part2a.vhdl, the jump and branch instructions must be
implemented. This is shown in green on the upper part of the schematic.



The signal out of the jump address box would be coded in VHDL as:

jump_addr <= PCP(31 downto 28) & ID_IR(25 downto 0) & "00";

The adder symbol is just another instance of your Homework 4, add32.

The "shift left 2" is a simple VHDL statement:

shifted2 <= ID_sign_ext(29 downto 0) & "00";

The project writeup:  part2a

For more debugging, uncommment print process and diff against:
part2a_print.chk
part2a_print.chkg