<- previous index next ->
Our design goal is to eliminate the need for nop instructions. The design method is to detect the need for a nop and stall the IF and ID stages of the pipeline, inserting a nop into the execution stage instruction register, EX_IR. The initial instruction sequence was: 400 lw $1,100($0) load general register 1 from memory location 100 404 lw $2,104($0) load general register 2 from memory location 104 408 nop 40C nop wait for register $2 to get data 410 add $3,$1,$2 add contents of registers 1 and 2, sum into register 3 414 nop 418 nop wait for register $3 to get data 41C add $4,$3,$1 add contents of registers 3 and 1, sum into register 4 420 nop 424 nop wait for register $4 to get data 428 beq $3,$4,-100 branch if contents of register 3 and 4 are equal to 314 42C add $4,$4,$4 add ..., this is the "delayed branch slot" always exec. The pipeline stage table with data forwarding and automatic hazard elimination reduces to: 400 lw $1,100($0) IF ID EX M WB 404 lw $2,104($0) IF ID EX M WB 408 add $3,$1,$2 IF ID ID EX M WB -- 40C add $4,$3,$1 IF IF ID EX M WB 410 beq $3,$4,-100 IF ID ID EX M WB 414 add $4,$4,$4 IF IF ID EX M WB time 1 2 3 4 5 6 7 8 9 10 11 12 (actually clock count) On any clock there can be only one instruction in each pipeline stage. Empty stages do not need to be shown, they have an inserted nop . (useful for Homework 8) Note that the -- indicates that IF stage and ID stage have stalled. The -- also indicates a nop instruction has automatically been inserted into the EX stage. A new instruction can not move into the ID stage when an instruction is stalled there. A new instruction can not move into the IF stage when an instruction is stalled there. No column may have more than one instruction in each stage. Any unlisted stage has a nop. The compiler may now generate compressed code for the computer architecture, saving on memory bandwidth because nop instructions are not needed in the executable memory image. (Except a rare nop instruction after a branch or jump instruction.) The primary task will be the implementation of a "stall" signal for the project part2b.vhdl. The "stall" signal will then be used to prevent clocking of the instruction fetch, IF stage and instruction decode, ID stage by using a new clock signal "sclk". The explanation for generating "sclk" is presented below. Note that when the nop instruction is muxed into EX_IR then the EX_RD must be set to zero along with the existing beq, sw and jump. The changes in part2b.vhdl are in the IF and ID stages. Green must be added. The signal "stall" is computed from the information presented below. A "hazard" is a condition in the pipeline when a stage of the pipeline would not perform the correct processing with the available data. To be a hazard, the action of data forwarding, covered in the previous lecture, must be taken into account. Some cases where hazards would occur are: lw $1,100($0) add $2,$1,$1 EX stage MEM stage add $2,$1,$1 lw $1,100($0) hazard! value for $1 not available Thus hold add $2,$1,$1 in ID stage, insert nop in EX, this is a stall. ID stage EX stage MEM stage add $2,$1,$1 nop lw $1,100($0) no hazard ID stage EX stage MEM stage WB stage add $2,$1,$1 nop lw $1,100($0) no hazard | | | +--+-------------------+ data forwarding add $4,$3,$1 beq $3,$4,-100 ID stage EX stage beq $3,$4,-100 add $4,$3,$1 hazard! value for $4 not available ID stage EX stage MEM stage beq $3,$4,-100 nop add $4,$3,$1 no hazard | | +---------------------------------+ data forwarding lw $5,40($1) beq $5,$4,L2 ID stage EX stage beq $5,$4,L2 lw $5,40($1) hazard! value for $5 not available ID stage EX stage MEM stage beq $5,$4,L2 nop lw $5,40($1) hazard! value for $5 not available ID stage EX stage MEM stage WB stage beq $5,$4,L2 nop nop lw $5,40($1) no hazard | | +------------------------------------------+ normal lw Cases for stall hazards (taking into account data forwarding) based on cs411 schematic. This is NOT VHDL, just definitions. Note: ( OP stands for opcode, bits (31 downto 26) lw stands for load word opcode "100011" addi stands for add immediate opcode "001100" etc. rr_op stands for OP = "000000" ) lw $a, ... op $b, $a, $a where op is rr_op, beq, sw stall_lw is EX_OP=lw and EX_RD/=0 and (ID_reg1=EX_RD or ID_reg2=EX_RD) and ID_OP/=lw and ID_OP /=addi and ID_OP/=j (note: the above handles the special cases where sw needs both registers. sll, srl, cmpl have a zero in unused register. no stall can occur based on EX_RD, MEM_RD or WB_RD = 0) lw $a, ... lw $b,addr($a) or addi $b,addr($a) stall_lwlw is EX_OP=lw and EX_RD/=0 and (ID_OP=lw or ID_OP=addi) and ID_reg1=EX_RD lw $a ... beq $a,$a, ... stall_mem is ID_OP=beq and MEM_RD/=0 and MEM_OP=lw and (ID_reg1=MEM_RD or ID_reg2=MEM_RD) op $a, ... where op is rr_op and addi beq $a,$a, ... stall_beq is ID_OP=beq and EX_RD/=0 and (ID_reg1=EX_RD or ID_reg2=EX_RD) ID_RD is 0 for ID_OP= beq, j, sw, stall (nop automatic zero) thus EX_RD, MEM_RD, WB_RD = 0 for these instructions rr_op is "000000" for add, sub, cmpl, sll, srl, and, mul, ... stall is stall_lw or stall_lwlw or stall_mem or stall_beq Be sure to use this semesters cs411_opcodes.txt, it changes every semester. cs411_opcodes.txt for op codes An partial implementation of stall_lw is: to get slw5 use "001100" for addiop per cs411_opcodes.txt To check on the "stall" signal, you may need to add: prtstall: process (stall) variable my_line : LINE; -- my_line needs to be defined begin write(my_line, string'("stall=")); write(my_line, stall); -- or hwrite for long signals write(my_line, string'(" at=")); write(my_line, now); -- "now" is simulation time writeline(output, my_line); -- outputs line end process prtstall; stall clock, sclk, is: for raising edge registers clk or stall (our circuit) For checking your results: part2b.chk look for inserted nop's part2b.jpg complete schematic as jpeg image part2b.ps complete schematic as postscript image Project writeup part2b Why is eliminating nop from the load image important? Answer: memory bandwidth. RAM memory has always been slower than the CPU. Often by a factor of 10. Thus, the path from RAM memory into the CPU has been made wide. a 64 bit wide memory bus is considered small today. 128 bit and 256 bit memory input to the CPU is common. Many articles have been written that say "adding more RAM to your computer will give more performance improvement than adding a faster CPU." This is often true because of the complex interaction of the operating system, application software, computer architecture and peripheral equipment. Adding RAM to most computers is easy and can be added by non experts. The important step in adding more RAM is to get the correct Dual Inline Memory Modules, DIMM's. There are speed considerations, voltage considerations, number of pins and possible pairing considerations. The problem is that there are many choices. The following table indicates some of the choices yet does not include RAM size. Type Memory Symbol Module DIMM Nominal Memory Bus Bandwidth Pins Voltage clock DDR4 1700Mhz PC4-2133 25.6GB/sec 288 1.2 volt DDR3 1600Mhz PC3-12800 12.8GT/sec 240 1.6 volt 200Mhz 38.4GB/sec may DDR3 1333Mhz PC3-10600 10.7GT/sec 240 1.6 volt 166Mhz triple DDR3 1066Mhz PC3-8500 8.5GT/sec 240 1.6 volt 133Mhz channel DDR3 800Mhz PC3-6400 6.4GT/sec 240 1.6 volt 100Mhz (10ns) DDR2 1066MHz PC2-8500 17.0GB/sec 240 2.2 volt two channel DDR2 1000MHz PC2-8000 16.0GB/sec 240 2.2 volt DDR2 900MHz PC2-7200 14.4GB/sec 240 2.2 volt DDR2 800MHz PC2-6400 12.8GB/sec 240 2.2 volt DDR2 667MHz PC2-5300 10.6GB/sec 240 2.2 volt DDR2 533MHz PC2-4200 8.5GB/sec 240 2.2 volt DDR2 400MHz PC2-3200 6.4GB/sec 240 2.2 volt DDR 556MHz PC-4500 9.0GB/sec 184 2.6 volt DDR 533MHz PC-4200 8.4GB/sec 184 2.6 volt DDR 500MHz PC-4000 8.0GB/sec 184 2.6 volt DDR 466MHz PC-3700 7.4GB/sec 184 2.6 volt DDR 433MHz PC-3500 7.0GB/sec 184 2.6 volt DDR 400MHz PC-3200 6.4GB/sec 184 2.6 volt DDR 366MHz PC-3000 5.8GB/sec 184 2.6 volt DDR 333MHz PC-2700 5.3GB/sec 184 2.6 volt DDR 266MHz PC-2100 4.2GB/sec 184 2.6 volt DDR 200MHz PC-1600 3.2GB/sec 184 2.6 volt Pre DDR had 168 pin 3.3 volt DIMM's. Older machines had 72 pin RAM Then, there is the size of the DIMM in bytes. (may need 2 DDR2 or 3 DDR3 in parallel, minimum 6GB DDR3) 128MB 256MB 512MB 1024MB 1GB 2048MB 2GB 4096MB 4GB Then, there is a choice of NON-ECC or ECC, Error Correcting Code that may be desired in commercial systems. Then, possibly a choice of buffered or unbuffered. Then, a choice of response CL3, CL4, CL5 clock waits. (in detail may read 7-7-7-20 notation) Then, shop by price or manufacturers history of reliability. Some systems require DIMM's of the same size and speed be installed in pairs. Read your computers manual or check for information on WEB sites. I have uses the following sites to get information and purchase more RAM. www.crucial.com You may search by your computers make and model, or by DDR2 and see specification to find what is available. www.kingston.com www.kingston.com KHX8500 www.valueram.com/datsheets/KHX8500D2_1G.pdf Now, how can an architecture best make use of the combination of pipelines and memory. IBM Cell Processor uses an architecture of a general purpose CPU on chip with eight additional pipeline processors. Cell-tutorial.pdf HW8 is assigned part2b is assigned For more debugging, uncomment print process and diff against: part2b_print.chk
<- previous index next ->