Data forwarding example CMSC 411 architecture Consider the five stage pipeline architecture: IF instruction fetch, PC is address into memory fetching instruction ID instruction decode and register read out of two values EX execute instruction or compute data memory address M data memory access to store or fetch a data word WB write back value into general register IF ID EX M WB +--+ +--+ +--+ +--+ +--+ | | | | | A|-|\ | | | | | | | | /---| | \ \_| | | | |PC|-(I)-|IR|-(R) = | | / / | |-(D)-| |--+ | | | | ^ \---| B|-|/ | | | | | +--+ +--+ | +--+ +--+ +--+ | ^ ^ | ^ ALU ^ ^ | | | | | | | | clk-+--------+-----------+--------+--------+ | | | +-----------------------------+ Now consider the instruction sequence: 400 lw $1,100($0) load general register 1 from memory location 100 404 lw $2,104($0) load general register 2 from memory location 104 408 nop 40C nop wait for register $2 to get data 410 add $3,$1,$2 add contents of registers 1 and 2, sum into register 3 414 nop 418 nop wait for register $3 to get data 41C add $4,$3,$1 add contents of registers 3 and 1, sum into register 4 420 nop 424 nop wait for register $4 to get data 428 beq $3,$4,-100 branch if contents of register 3 and 4 are equal to 314 42C add $4,$4,$4 add ..., this is the "delayed branch slot" always exec. The pipeline stage table with NO data forwarding is: lw IF ID EX M WB lw IF ID EX M WB nop IF ID EX M WB nop IF ID EX M WB add IF ID EX M WB nop IF ID EX M WB nop IF ID EX M WB add IF ID EX M WB nop IF ID EX M WB nop IF ID EX M WB beq IF ID EX M WB add IF ID EX M WB time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 This can be significantly improved with the addition of four multiplexors and wiring. IF ID EX M WB +--+ +--+ +--+ +--+ +--+ | | | | | A|-(X)--|\ | | | | | | | | /-(X)--| | | | \ \_| | | | |PC|-(I)-|IR|-(R) | = | | | | / / | |-+-(D)-| |--+ | | | | ^ \-(X)--| B|-(X)--|/ | | | | | | +--+ +--+ | | +--+ | | +--+ | +--+ | ^ ^ | | ^ | | ALU ^ | ^ | | | | | | | | | | | | clk-+--------+--------------+-------------+----------+ | | | | | | | | +----------+-----------+ | | | | +-------------+-------------------------+ The pipeline stage table with data forwarding is: lw IF ID EX M WB lw IF ID EX M WB nop IF ID EX M WB saved one nop add IF ID EX M WB $2 in WB and used in EX add IF ID EX M WB saved two nop's $3 used nop IF ID EX M WB saved one nop beq IF ID EX M WB $4 in MEM and used in ID add IF ID EX M WB time 1 2 3 4 5 6 7 8 9 10 11 12 Note the required nop from using data immediately after a load. Note the required nop for the beq in the ID stage using an ALU result. The data forwarding paths are shown in green with the additional multiplexors. The control is explained below. Green must be added to part2a.vhdl. Blue already exists, used for discussion, do not change. To understand the logic better, note that MEM_RD contains the register destination of the output of the ALU and MEM_addr contains the value of the output of the ALU for the instruction now in the MEM stage. If the instruction in the EX stage has the MEM_RD destination in bits 25 downto 21, then MEM_addr must be routed to the A side of the ALU. (This is the A forward MEM_addr control signal.) EX stage MEM stage add $4,$3,$1 add $3,$1,$2 | | +---------------+ If the instruction in the EX stage has the MEM_RD destination in bits 20 downto 16, then MEM_addr must be routed to the B side of the ALU. (This is the B forward MEM_addr control signal.) EX stage MEM stage add $4,$1,$3 add $3,$1,$2 | | +------------+ To understand the logic better, note that WB_RD contains the register destination of the output of the ALU or Memory and WB_result contains the value of the output of the ALU or Memory for the instruction now in the WB stage. If the instruction in the EX stage has the WB_RD destination in bits 25 downto 21, then WB_result must be routed to the A side of the ALU. (This is the A forward WB_result control signal.) If the instruction in the EX stage has the WB_RD destination in bits 20 downto 16, then WB_result must be routed to the B side of the ALU. (This is the B forward WB_result control signal.) Note that a beq instruction in the ID stage that needs a value from the instruction in the WB stage does not need data forwarding. A beq instruction in the ID stage has the MEM_RD destination in bits 25 downto 21, then MEM_addr must be routed to the top side of the equal comparator. (This is the 1 forward control signal.) A beq instruction in the ID stage has the MEM_RD destination in bits 20 downto 16, then MEM_addr must be routed to the bottom side of the equal comparator. (This is the 2 forward control signal.) ID stage EX stage MEM stage beq $3,$4,-100 nop add $4,$3,$1 | | +----------------------------+ A beq instruction in the ID stage has the WB_RD destination in bits 20 downto 16, then WB_result must be used by the bottom side of the equal comparator. (This happens by magic. Not really, two rules above apply.) ID stage EX stage MEM stage WB stage beq $3,$4,-100 nop nop lw $4,8($3) | | +-------------------------------------+ The data forwarding rules can be summarized based on the cs411 schematic, shown above. ID stage beq data forwarding: default with no data forwarding is ID_read_data_1 1 forward MEM_addr is ID_reg1=MEM_RD and MEM_rd/=0 and MEM_OP/=lw default with no data forwarding is ID_read_data_2 2 forward MEM_addr is ID_reg2=MEM_RD and MEM_rd/=0 and MEM_OP/=lw EX stage data forwarding: default with no data forwarding is EX_A A forward MEM_addr is EX_reg1=MEM_RD and MEM_RD/=0 and MEM_OP/=lw A forward WB_result is EX_reg1=WB_RD and WB_RD/=0 default with no data forwarding is EX_B B forward MEM_addr is EX_reg2=MEM_RD and MEM_RD/=0 and MEM_OP/=lw B forward WB_result is EX_reg2=WB_RD and WB_RD/=0 Note: the entity mux32_3 is designed to handle the above. ID_RD is 0 for ID_OP= beq, j, sw (nop, all zeros, automatic zero in RD) thus EX_RD, MEM_RD, WB_RD = 0 for these instructions Because register zero is always zero, we can use 0 for a destination for every instruction that does not produce a result in a register. Thus no data forwarding will occur for instructions that do not produce a value in a register. note: ID_reg1 is ID_IR(25 downto 21) ID_reg2 is ID_IR(20 downto 16) EX_reg1 is EX_IR(25 downto 21) EX_reg2 is EX_IR(20 downto 16) MEM_OP is MEM_IR(31 downto 26) EX_OP is EX_IR(31 downto 26) ID_OP is ID_IR(31 downto 26) These shorter names can be used with VHDL alias statements alias ID_reg1 : word_5 is ID_IR(25 downto 21); alias ID_reg2 : word_5 is ID_IR(20 downto 16); alias EX_reg1 : word_5 is EX_IR(25 downto 21); alias EX_reg2 : word_5 is EX_IR(20 downto 16); alias MEM_OP : word_6 is MEM_IR(31 downto 26); alias EX_OP : word_6 is EX_IR(31 downto 26); alias ID_OP : word_6 is ID_IR(31 downto 26); Why is the priority mux, mux32_3 needed? mux32_3.vhdl gives priority to ct1 over ct2 Answer: Consider MEM_RD with a destination value 3 and WB_RD with a destination value 3. What should add $4,$3,$3 use? MEM_addr or WB_result ? For this to happen, some program or some person would have written code such as: sub $3,$12,$11 add $3,$1,$2 add $4,$3,$3 double the value of $3 Well, rather obviously, the result of the sub is never used and thus the answer to our question is that MEM_addr must be used. This is the closest prior instruction with the required result. The correct design is implemented using the priority mux32_3 with the MEM_addr in the in1 priority input. The control signal A forward MEM_addr may be implemented in VHDL as: btw: 100011 in any_IR(31 downto 26) is the lw opcode in this example, be sure to check this semesters cs411_opcodes.txt Here is where you may want to add a debug process. Replace AFMA with any signal name of interest: prtAFMA: process (AFMA) variable my_line : LINE; -- my_line needs to be defined begin write(my_line, string'("AFMA=")); write(my_line, AFMA); -- or hwrite for long signals write(my_line, string'(" at=")); write(my_line, now); -- "now" is simulation time writeline(output, my_line); -- outputs line end process prtAFMA; part2a.chk has the _RD signals and values cs411_opcodes.txt for op code values Now, to finish part2a.vhdl, the jump and branch instructions must be implemented. This is shown in green on the upper part of the schematic. The signal out of the jump address box would be coded in VHDL as: jump_addr <= PCP(31 downto 28) & ID_IR(25 downto 0) & "00"; The adder symbol is just another instance of your Homework 4, add32. The "shift left 2" is a simple VHDL statement: shifted2 <= ID_sign_ext(29 downto 0) & "00"; The project writeup: part2a For more debugging, uncommment print process and diff against: part2a_print.chk part2a_print.chkg