The most important item on all homework is YOUR NAME!
No name, no credit. ALSO, put last 4 digits of SS#.
Staple or clip pages together.
Homework must be submitted when due. You loose 10%, one grade, the first day homework is late. Then 10% each week thereafter. Max 50% off. A zero really hurts your average! Paper or EMail to squire@cs.umbc.edu is acceptable. If I can not read or understand your homework, you do not get credit. Type or print if your handwriting is bad. Homework is always due on a scheduled class day within 15 minutes after the start of the class. If class is canceled then homework is due the next time the class meets.
EMail only plain text! No word processor formats.
You may use a word processor or other software tools and
print the results and turn in paper.
Put CS611 and HW number in subject line.
The "submit" facility only works on the "irix.gl.umbc.edu"
and linux.gl.umbc.edu machines.
The student commands are:
submit cs611 HW6 file puts your "file" into cs611 HW6
submitrm cs611 HW6 file removes your "file" from cs611 HW6
submitls cs611 HW6 lists your files in cs611 HW6
Note: "HW" is upper case
a) you must have your userid registered for "submit"
send mail from a gl machine to squire if your submit fails
b) you have to be logged onto a gl machine, SSH or telnet are OK
c) everything is case sensitive, remember the uppercase HW.
You must show your work, not just the answer. Book Page 60, Exercise 1.2 Book Page 62, Exercise 1.7
You must show your work, not just the answer. Book Page 121, Exercise 2.11 Book Page 122, Exercise 2.12
You must show your work, not just the answer. Book Page 214, Exercise 3.1 Book Page 219, Exercise 3.12
1. Does a DLX sequence of instructions exists that must
stall in a "scoreboard" machine, Figure 4.3, Page 244,
yet the same sequence will not stall in a Tomasulo machine,
Figure 4.8, Page 253, ? (yes or no)
2. Does a DLX sequence of instructions exists that must
stall in a Tomasulo machine, Figure 4.8, Page 253,
yet the same sequence will not stall in a "scoreboard"
machine, Figure 4.3, Page 244, ? (yes or no)
3. Using the paper "Combining Branch Predictors" by Scott
McFarling, use the instruction trace given below and
update a Bimodal Predictor, Figure 1, for the case:
two PC bits, four entries in the count vector, two
bit counts as described in the paper.
Initialize all counts to 10 base 2 (different from paper)
a) show the four count values at the end of the trace.
b) keep a count of correct predictions as you update
the counts in order to give the percent predicted
correctly.
The following trace, as decimal integers, represent the
sequence of PC (low order bits) of conditional branches
and the letter T following the PC value indicates the
branch was taken.
1T, 1T, 1T, 1, 3T, 2, 1T, 1T, 1T, 1, 3, 2, 1T, 1T, 1T, 1,
0T, 0T, 0T, 0, 3T, 2, 1T, 1T, 1T, 1, 2, 3
(Sample output with dummy answers should look like:
0 00 <-- initially 10 in all cases
1 11
2 01
3 11
50%)
Draw the diagram and compute values for the cache system
described below. The diagram can be drawn free hand yet
needs to be neat enough to be read. Use similar level of detail
as was on the handout in class on caches. Show at least the
tag comparators and "and" gate with the valid bit. Show all
four rows of the L1 cache, and about 8 rows of the 65,536 rows
of the L2 cache. Use this diagram to hand simulate the caches
action when running the address sequence below.
L1 instruction cache for a DLX machine,
2-way associative, block size is 4 words (16 bytes),
index field in PC is two bits (e.g. 4 blocks long)
LRU (Least recently used) replacement policy.
Thus PC bits are +--------------+-------+---------+----------+
| tag | index | word | byte |
| | | select | select |
+--------------+-------+---------+----------+
bit number 31 6 5 4 3 2 1 0
Timing: the instruction is delivered in 1 ns for a hit,
a miss requires that a block be filled from the L2 cache.
(The 1 ns is still used by the L1 cache, even on a miss.)
L2 general cache, direct mapped, block size is 4 words (16 bytes)
index field is 16 bits (i.e. 65,536 blocks long)
Timing: four words are delivered to L1 in 8 ns for a hit,
a miss requires that a block be filled from RAM.
Assume the 8 ns includes time to get the address,
put the four words on the bus into the L1 cache and
raise the "L2 hit" signal. (The data from RAM flows
through the L2 cache on the way to the L1 cache,
thus the 8 ns is used by the L2 cache, even on a miss.)
RAM, 128 bit bus, (16 bytes) (4 words) delivered to L2 in 20 ns
Assume the 20 ns includes time to get the address, fetch
the data, put the data on the bus and raise the "data_ready"
signal for the L2 cache..
All "valid" bits are initially zero.
From the above, the first instruction takes 1 + 8 + 20 = 29 ns.
Other facts: The memory to L2 cache bus is 128 bits wide.
The L2 to L1 bus is 128 bits wide, thus no word
select multiplexer on the L2 cache.
Given the sequence of PC addresses below,
1) What is the total time to deliver all instructions. (ns)
2) What is the average time to deliver all instructions. (ns)
3) What is the L1 cache miss rate. (fraction)
4) What is the L2 cache miss rate looking at only the L2 cache.
(fraction)
5) Assume one clock per nanosecond (ns) What is the average CPI.
(xx.xx clocks per instruction)
6) Show hit or miss on each cache for each PC.
(Do this first, of course!)
L1 L2
PC: 00000000 <--- show H for hit, M for miss
00000010 blank for unused.
00000020
00000030 for each address for L1 and L2
00000040
00000004
00000008 * change, was 6
00000080
00000044
00000008
440001F4
110003B8 * change, 8 was 6
00000038
Write the VHDL code to perform an IEEE 754 Floating point add. You are given two 32-bit floating point numbers that are to be added to produce a third floating point number. Simplifications you may use include: Input numbers are normalized. No overflow or underflow or denormalization will occur. No rounding is necessary (either use truncation or round toward zero.) Use VHDL add, subtract, shift and other operators as needed. You do not have to go to the gate level. Use fp_add_test.vhdl as a start. Fill in the architecture behavior of fp_add to do the IEEE floating point add. Choose some reasonable test data for "a" and "b" in the test bench. The handout in class shows the commands needed on sunserver1.cs.umbc.edu A previous handout shows commands for linux.gl.umbc.edu but you will have to delete a few words to make VHDL-87 rather than VHDL-93 Look at VHDL help for more information. Compile and run. When reasonably correct, on gl.umbc.edu do a submit submit cs611 HW6 fp_add_test.vhdl
Closed book. Short answer, Numeric problems and some Multiple choice.
Numerical problems will be on CPI, Amdahl's Law, Pipelining,
Branch Prediction, Cache and IEEE Floating Point.
Exam covers book: 1.5, 1.6,
2.3, 2.8,
3.1-3.5, 3.7, 3.9,
4.2, 4.4
5.1-5.5
lectures: 1 through 14 excluding Introduction and VHDL
homework: 1 through 5
papers: McFarling "Combining Branch predictors"
Based on textbook and lecture answer the following:
Show your work.
Q1. Given a PCI bus running at 66MHz, 64 bits wide,
what is the maximum bandwidth in MHz?
(I could have asked for Mb/sec, same number, yet MB/sec is wrong!)
Q2. Given a Ultra SCSI 160MB/s controller and disk drive that
spins at 10,000 rpm and has an average seek of 6ms, 160MB/s
transfer rate, defragmented.
How long does it take, in seconds, to transfer 3.2MB where
each disk transfer is a 32KB block?
a) assuming 1/3 average seek time for first block and average
rotational delay for all blocks.
b) like a) but assuming a 4MB internal disk buffer so that only
the first block has a rotational delay penalty.
All numbers in decimal, K=1,000, M=1,000,000
Defragmented disks will typically only pay the seek penalty
on the first block, the internal disk buffer can prevent
any rotational delay penalty assuming there is room for
read-a-head. Assume ideal timing.
Q3. How many raw bytes must be stored to have one hour of
music played at 44.1 KHz with 16 bits coming from each of
two channels?
Q4. What bandwidth in MHz (one bit per clock) is needed to continuously
read 4.7GB of digital data from a 12X DVD in 10 minutes?
(ignore seek and rotational delay, G=1,000,000,000)
Given a queue/server model M/M/4
Given an average-arrival-rate 20 tasks per second
Q1. Given a single-server-utilization of 80%
a) What is the average-single-server-rate in tasks per second ?
b) What is the average-time-in-queue for a task ?
c) What is the average-time-in-system for a task ?
d) What is the average-tasks-in-queue ?
Q2. What is the maximum-tasks-in-queue ?
Possible short answers:
average-tasks-in-queue / single-server-utilization
about ten times the average-tasks-in-queue
unbounded
Q3. Given that we want the average-tasks-in-queue to be 10 tasks,
(Still M/M/4 and average-arrival-rate of 20 tasks per second)
What single-server-utilization is needed?
(Answer as a percentage within 2% gets full credit.)
Comments: None should be needed and these may be redundant,
yet, in order to prevent long lines asking questions:
M/M/4 technically stands for
First M means memory-less random distribution of arrivals
Second M means memory-less random distribution of service times
4 means the single queue is feeding four servers
For a server problem it is reasonable to assume an
exponential probability distribution for each M.
It is reasonable to assume the four servers equally
divide the workload of one server that is four times as fast.
It is reasonable to assume all servers have the same utilization.
The equations in the textbook on page 509 and 510 apply.
The equations given in class represent the same equations
that are in the textbook.
As always, do not plug numbers into randomly selected equations.
Try to understand what equations apply to the problem
and check if your answers are intuitively reasonable.
Not assigned in Fall 2000
Comprehensive, about 1/3 pre midterm, 2/3 post midterm
True/False, multiple choice, short answer
In range 25 to 50 questions.
The exam covers:
Lectures 3 - 13, 16 - 28
Homework 1 - 8
McFarling Paper through bimodal
IEEE 754 paper, floating point add, sub, mul, div
Textbook 1.5
3.2 - 3.7
4.1 , 4.2 , 4.4 , 4.8
5.2 - 5.7
6.2 - 6.5
7.2 , 7.5
8.2 - 8.6
A.3 - A.5
B.3
E.1
Last updated 12/12/00