[CMSC 411 Home] | [Syllabus] | [Project] | [VHDL resource] | [Homework 1-6] | [Homework 7-12] [Files] | [Lecture Notes]
The most important item on all homework is YOUR NAME! Printed. No readable name, no credit. Put name inside attachment if EMail. Staple or clip pages together for turn in.
Homework must be submitted soon after when due. If I can not read or understand your homework, you do not get credit. Type or print if your handwriting is bad.
OK to submit .doc, .docx, .pdf, .png, etc. You may use a word processor or other software tools.
All parties involved in copying get a zero or worse on that assignment.
1) From the diagram included below: a) What is the speedup of the pipelined execution (bottom) over the single cycle execution (top) for the three instructions ? (This is an unusual question, b) is the normal speedup.) b) For a large number of instructions we consider how often an instruction can be completed, or started. From the figure you can see the pipelined execution starts an instruction every 4ns while the single cycle execution starts an instruction every 14ns. What is the speedup when a large number of instruction are executed? c) Make a change to both executions. Make the ALU take 5ns rather that the 4ns as shown on the figure. Neatly redraw the diagram with the new ALU time. Remember every time block on the pipelined execution must be the same. 80 column neat attachment or print it yourself. d) What is the speedup for c) when a large number of instructions are executed. Diagram: Single cycle execution 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42ns | | | | | | | | | | | | | | | | | | | | | | +---+---+-------+-------+---+ |IF |reg| ALU | DATA |reg| +---+---+-------+-------+---+ +---+---+-------+-------+---+ |IF |reg| ALU | DATA |reg| +---+---+-------+-------+---+ +---+---+-------+-------+---+ |IF |reg| ALU | DATA |reg| +---+---+-------+-------+---+ Pipelined Execution +-------+-------+-------+-------+-------+ |IF | reg| ALU | DATA |reg | +-------+-------+-------+-------+-------+ +-------+-------+-------+-------+-------+ |IF | reg| ALU | DATA |reg | +-------+-------+-------+-------+-------+ +-------+-------+-------+-------+-------+ |IF | reg| ALU | DATA |reg | +-------+-------+-------+-------+-------+ | | | | | | | | | | | | | | | | | | | | 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38ns 2) Show the register pipeline dependency by circling the register where a value is computed and circling the register where that value is used, connecting the circles by a line. (Spread the code out for clarity, as needed.) add $5, $3, $4 add $4, $3, $5 add $3, $4, $5 add $4, $3, $4 3) The following four lines of code can be reduced to exactly three lines of code that produce the same output for all possible initial register values. Basically, you are reorganizing the code to put a useful instruction in the delayed branch slot. Every line must be correct to get credit for this part. You do not need to draw a pipeline diagram. Assume a correct computer that does not need any nop except after a branch or jump. Loop: lw $2, 64($3) addi $3, 4($3) beq $3, $4, Loop nop Be sure to walk through this code and your code for initial conditions: $3 has 4, $4 has 8, memory location 68 has 16, memory location 72 has 32, memory location 76 has 64. The results must be the same as the given code and your code. Then check $3 has 12, $4 has 8, same memory. The results must be the same as the given code and your code. The "C" code could be something like: loop: r2=mem[64+r3]; /* r2, r3, r4 convert from variables to registers */ r3+=4; if(r3==r4) goto loop; /* cases are r3==r4-4 at start, else any r3 */ /* can only go to loop once or zero times */ This exercise is demonstrating that the "delayed branch slot" does not have to contain a nop. Your assembly code will not have a nop but will reorder and change the instructions to have some other instruction after the beq. submit cs411 HW7 your.file1 your.file2 ...
1) Note: If a data hazard is NOT prevented by data forwarding, then the pipeline stalls. Assume data forwarding is working, does the following code stall? "stall" means it needs an extra nop instruction inserted, If the code stalls then list the labels of the instruction(s) that cause a stall. (Assume you have completed part2a and part2b of the project.) This question is asking if stall signal is set to '1', four cases. See lecture 20 for four cases, and sample pipeline stages. L1: lw $3, 50($3) L2: add $2, $3, $4 L3: sw $2, 20($2) L4: lw $5, 30($2) L5: lw $6, 40($5) L6: or $5, $6, $5 2) Draw the pipeline diagram, you may use the style similar to Lecture 19 and 20, or similar to HW7 simplified to use one clock at one ns for every stage. There are no nop instructions to be fetched, yet show the effect of hazard prevention by repeating the stage name when an instruction is stalled. We have full data forwarding and hazard prevention that automatically inserts nop as needed into the pipeline. (Assume you have completed part2a and part2b of the project.) add $5, $6, $7 lw $6, 60($5) sub $7, $6, $5 or $8, $4, $6 3) How many total clock cycles, start to completion, are required by your expanded code in part 2) 4) Assuming all instructions are in the cache: If the nop instruction was actually in the code, rather than being inserted into the pipeline, would the total clock cycles be the same as in question 3) submit cs411 HW8 your.file1 your.file2 ...
See Lecture 21 for method and sample. Given the following sequences of word addresses, in decimal, 1, 4, 2, 7, 25, 4, 27, 3, 5, 24, 19, 18, 57, 1, 27, 3, 11 (no modification needed, just convert to binary 1 is 000001, 56 is 111000, etc, these are memory addresses) 1) Simulate an 8 word cache with one word per block, direct mapped. a) For each address, list the six bit binary and indicate H for hit, M for miss. b) Draw the cache showing cache binary address and cache contents after all addresses have been processed. Use (1) for the contents of memory address 1, (4) for the contents of memory address 4, etc. Show valid bit and tag. 2) Simulate a 16 word cache with four words per block, direct mapped. a) For each address, list the six bit binary and indicate H for hit, M for miss. b) Draw the cache showing cache binary address and cache contents after all addresses have been processed. Use (1) for the contents of memory address 1, (4) for the contents of memory address 4, etc. Show valid bit and tag. Typical format for partial data at end of sequence. eight word cache sixteen word cache one word per block four words per block data v tag data v tag 00 01 10 11 +-+---+----+ +-+---+----+----+----+----+ 000 | | | | 00 | | | | | | | +-+---+----+ +-+---+----+----+----+----+ 001 | | | | 01 | | | | | | | +-+---+----+ +-+---+----+----+----+----+ 010 | | | | 10 |1| 00|( 8)|( 9)|(10)|(11)| +-+---+----+ +-+---+----+----+----+----+ 011 |1|001|(11)| 11 | | | | | | | +-+---+----+ +-+---+----+----+----+----+ 100 | | | | +-+---+----+ 101 | | | | +-+---+----+ 110 | | | | +-+---+----+ 111 | | | | +-+---+----+ Make list of addresses in proper format, indicate H for hit, M for miss last line of each: addr tag ix addr tag ix word 11 001 011 M 11 00 10 11 M The last address, decimal 11, has its contents shown in the cache. (You do not get points for this entry.) submit cs411 HW9 your.file
Given three memory organizations below, for each, compute the time in cycles to load the cache and compute the average memory latency. a) b) c) +-----+ +-----+ +-----+ | CPU | | CPU | | CPU | +-----+ +-----+ +-----+ | | | +-----+ +---+---+---+---+ +-----+ |cache| | cache | |cache| +-----+ +---+---+---+---+ +-----+ / \ / \ / \ | bus | | bus | | bus | \ / \ / \ / +-----+ +---+---+---+---+ +---+ +---+ +---+ +---+ | | | | | | | | | | | | | | | registers +-----+ +---+---+---+---+ +---+ +---+ +---+ +---+ | | | | | | | | | | | | | mem | | mem | |m0 | |m1 | |m2 | |m3 | | | | | | | | | | | | | +-----+ +---+---+---+---+ +---+ +---+ +---+ +---+ cache 16 words cache 16 words cache 16 words per block per block per block bus one word wide bus four words wide bus one word wide memory one word wide memory four words wide memory, four independent one word wide memories The bus requires one clock to pass data. Only one thing at a time can be on the bus. The bus four words wide passes four words in one cycle. The bus one word wide can pass only one word per cycle. One cycle, the bus time, is required to send the address from the CPU to the memory for all three memory organizations. The address will always be on a 16 word boundary and the memories know they are to send 16 words (the cache block size is 16 words) Every memory takes 5 cycles from the time an address is applied until the data is fetched from the memory. During this time the address can not be changed. The data fetched from memory then takes one cycle on the bus to get into the cache. The next address can be applied to the memory at the start of this bus transfer cycle. This is known as fetching overlapped with bus transfer. It requires a register to hold the bits the memory has fetched. Not shown, is the address incrementer inside the memory unit. For a) each address is one greater and 16 fetches occur. b) each address is four greater and 4 fetches occur. c) m0 gets addresses 0, 4, 8, 12, four fetches occur m1 gets addresses 1, 5, 9, 13, four fetches occur m2 gets addresses 2, 6, 10, 14, four fetches occur m3 gets addresses 3, 7, 11, 15, four fetches occur fetching is overlapped, concurrent, in m0, m1, m2, m3. List each clock cycle (or range of clock cycles) and show what is happening or a formula for this specific case that you derive from looking at the clock cycles. "W0" stands for the word at the base address. The quote marks have the meaning 'ditto' which means same as above. a) b) c) 1 address on bus address on bus address on bus 2 fetching W0 fetching W0-W3 m0 fetching W0 3 fetching W0 fetching W0-W3 " m1 fetching W1 4 fetching W0 fetching W0-W3 " " m2 fetching W2 5 fetching W0 fetching W0-W3 " " " m3 fetching W3 6 fetching W0 fetching W0-W3 " " " " 7 word 0 on bus words 0-3 on bus word 0 on bus fetching W1 fetching W4-W7 m0 fetching W4 plus - " " " 8 fetching W1 fetching W4-W7 word 1 on bus m1 fetching W5 plus " - " " 9 fetching W1 fetching W4-W7 word 2 on bus m2 fetching W6 plus " " - " 10 fetching W1 fetching W4-W7 word 3 on bus m3 fetching W7 plus " " " - 11 fetching W1 fetching W4-W7 " " " " 12 word 1 on bus words 4-7 on bus w0rd 4 on bus etc. (please ignore the ditto marks if you don't understand them) ?? word 15 on bus words 12-15 on bus word 15 on bus ??/16 is miss penalty You may write out the full sequence or figure out a formula that works for this case. Include the formula if you use one. For each memory organization give the total clock cycles to load the cache. The last word of the cache must be loaded thus count the last bus cycle. This number is called the "miss penalty". The miss penalty would be divided by 16, the cache block size, to get the average increase in CPI for a cache miss, assuming instructions are executed sequentially. The miss penalty divided by 16 is called the average memory latency. Note that this is less than the 6 cycles for a single memory fetch for cases b) and c). Why can't we let the CPU execute an instruction when the first word is in the cache? Well, it might be the third word in the block that the CPU needs. Why can't we let the CPU execute an instruction when the word it needs is in the cache? Well, what if the CPU instruction used that word from the cache and computed a result that went into the last word in the cache block! The CPU would take 5 clocks to compute the value and put it into the cache but the cache may take 10 to 20 clocks before the last word is fetched from memory and put into the last word of the cache block, over-writing the computed value. So, the CPU pipeline is stalled while the cache is being loaded. This applies to part 3 of the project. Note that a "cycle" means a clock cycle. submit cs411 HW10 your.file
See middle of Lecture 23 for computing size in bits. 1) Given a virtual memory system with: virtual address 37 bits physical address 32 bits 32KB pages (15 bit page offset) Each page table entry has bits for valid, execute, read and dirty (4 bits total) and bits for a physical page number. a) How many bits in the page table? (do not answer in bytes!) Three digit accuracy is good enough. The exponent may be either a power of 2 or a power of 10. b) The virtual address is extended to 38 bits, all else stays the same. How many bits in the page table? (do not answer in bytes!) Three digit accuracy is good enough. The exponent may be either a power of 2 or a power of 10. Note: There will be a page table for every process that is running, yet the page tables are typically not completely allocated. Only the sections of the page table being used are typically populated. c) A fully associative TLB that has 32 blocks, 1 entry per block, is needed for the page table like a) VA=36, PA=32, PO=15. The TLB must hold a page table entry and a tag in each block. How many bits in the TLB? (do not answer in bytes!) d) Draw a two way associative TLB that has 4 blocks, 8 total PPN's, for the page table a) Virtual address 35 bits, physical address 32 bits, offset 15 bits, 4 bits V,E,R,D. See lecture 21 for 4 way associative cash, you only use 2. The top will be the virtual address with the virtual page number and virtual page offset. The bottom will be the physical address with the physical page number and physical page offset. Show the detail of all fields, connections, mux, comparators. Label the width of all fields and signals. Refer to the textbook or class lecture notes for sample TLB's. If you send EMail, make sure it prints in 80 column fixed width font. 2) Compute file read time for a 1MB file for a typical hard drive and a solid state drive, SSD, in milli seconds a) Hard drive: published average seek time 3.0ms rotation speed 10,000rpm overhead 2.0ms transfer rate 80MB/s b) SSD overhead 1.5ms transfer rate 80MB/s c) speedup of SSD drive submit cs411 HW11 your.file
Given a memory and an I/O device on a bus as shown below: 64 bit wide bus, synchronous, 50MHz clock ============================================================== | | address and | two 32-bit data address and | two 32-bit data word count | words sent word count | words received received in | in one clock sent in one | in one clock one clock | clock | | | +-----+-----+-----+-----+ +-----+-----+ | | | | |Registers | I/O device| +-----+-----+-----+-----+ | receiving | | RAM memory that is | | word-count| | four 32-bit words wide| | words of | | and has output | | data | | registers for four | +-----------+ | words | +-----------------------+ The system operates by the I/O device sending an address and word count to the memory. The memory uses the address to start a memory access that brings four 32-bit words into the memory output registers. In parallel, [the memory starts the access of the next four words] and [sends two words on the bus, then two more words on the bus, then a bus idle, then another bus idle]. Thus, the bus actions are overlapped with the next memory access. Note that the bus may be used by other devices between these four clock operations on the bus. The memory access is non uniform. Upon receiving an address, the first memory access takes longer than the following memory accesses. This happens in some memories due to the extra time to charge word or block select lines. Once started with an address and word count, the memory puts data on the bus until the word count is satisfied. For this exercise the first memory access requires 5 clocks and each additional memory access in the same transaction requires 4 clocks. A bus transaction starts with the sending of an address and word count. The transaction ends when the last word and two idles are received by the I/O device. The transaction time does not include the one clock to send the address and word count. Bandwidth is measured in megabytes per second. The address and word count are not included in the byte count and not included in the time. For the I/O device to get 512 words when the I/O device uses 4 as the word count. (512 words is 2048 bytes) a) Compute the total time from after receipt of the address to the end of the last transaction. b) Compute the transactions per second. c) Compute the bandwidth. For the I/O device to get 512 words when the I/O device uses 32 as the word count. d) Compute the total time from after receipt of the address to the end of the last transaction. e) Compute the transactions per second. f) Compute the bandwidth. Show your work as formulas or as tables. Be consistent. Use either clock counts or nanoseconds. Obviously a 50MHz clock uses 20 nanoseconds per clock.
Covered in WEB: Read over course WEB pages. (some have been updated) Work all homeworks. (some similar problems on exam) Do project at least through part2b. (some questions on exam)
Last updated 8/17/2020