CS411 Selected Lecture Notes

This is one big WEB page, used for printing

 These are not intended to be complete lecture notes.
 Complicated figures or tables or formulas are included here
 in case they were not clear or not copied correctly in class.
 Computer commands, directory names and file names are included.
 Specific help may be included here yet not presented in class.
 Source code may be included in line or by a link.

 Lecture numbers correspond to the syllabus numbering.

Contents

  • Lecture 1, Introduction, terminology
  • Lecture 2, Benchmarks
  • Lecture 3, Performance
  • Lecture 4, CPU Operation
  • Lecture 5, Instructions and Registers
  • Lecture 6, VHDL introduction
  • Lecture 7, Arithmetic
  • Lecture 8, ALU
  • Lecture 9, Multiply
  • Lecture 10, Divide
  • Lecture 11, Floating Point
  • Lecture 12, VHDL - circuits and debugging
  • Lecture 13, Microprogramming - Review
  • Lecture 14, mid-term exam
  • Lecture 15, Control Unit
  • Lecture 16, Pipelining 1
  • Lecture 17, Pipelining 2
  • Lecture 18, Project outline and VHDL
  • Lecture 19, Pipelining Data Forwarding
  • Lecture 20, Hazards and Stalls
  • Lecture 21, Cache
  • Lecture 22, Cache Performance
  • Lecture 23, Virtual Memory 1
  • Lecture 24, Virtual Memory 2
  • Lecture 25, I/O types and performance
  • Lecture 26, DVR, DVD-RW, CDR, CD-RW
  • Lecture 27, Busses, I/O-processor connection
  • Lecture 28, Multiprocessors
  • Lecture 29, Review
  • Lecture 30, Final Exam
  • Other Links
  • Lecture 1, Introduction, terminology

    Introduction: Hello, my name is Jon Squire and I have been programming computers since 1959. I have served my time in corporate management for 25 years. This course covers a little history of computer architecture through some of the latest advances and practical information you may use in buying, upgrading or building your own computer. After this course, you can say that you have performed "modeling and simulation" possibly a valuable asset in finding a job. You will be skilled in converting graphical and schematic information to textual information and the reverse. Some Brief History: The ISA card slots were replaced by PCI card slots that are replaced by external USB devices. The serial port for RS232 devices is replaced by the USB port. Floppy disk are disappearing along with that connector on the motherboard. RAM still uses DIMM's and the slots have grown to handle 4, 8 and 16 gigabytes of memory. ATA hard drives are replaced by SATA hard drives, 5TB and more available. Some rotating hard drives are being replaced by SSD, solid state drives. The printer port will be going as will the AGP graphics connector. HDMI and now DP. That expensive graphics card you bought will probably not work in your new computer. I have been saving architecture news. Overview: This course will present detailed information on the internal working of the CPU, cache, memory, busses and peripheral devices such as disk drives and DVD's. The course five part project will have each student simulate a small computer using the VHDL digital simulation language. Either Cadence VHDL or free GHDL. Read the syllabus. All of the lectures are covered in these WEB pages. Lecture notes are often updated. (You may ask questions.) And, sometimes corrected after questions. Some information is still presented on the blackboard/whiteboard. Check UMBC "Blackboard" for announcements and grades. The Top 500 Multiprocessor systems are evaluated about every six months. These are not your typical home computers. The Top 10 are shown www.top500.org/lists/2020/06 As many as 10 million cores! (How many in your computer?) More Lecture 1, pdf format The free market system and resulting competition, provide better and more economical products to consumers. Expect flip-flop between vendors for best or most economical products. A standard engineering statement is: Fast, Cheap, Reliable - pick any two. Monopolies: Ford Motor Company, Standard Oil of New Jersey, IBM, AT&T, ... Microsoft. Computer Architecture Development: System Architecture Logic Design Circuit Design Device Physics For the inverter above, a chip cross section is: N type and P type impurities are diffused into the silicon substrate through a mask, typically in a high temperature vacuum process. Oh! Oh! It is now predicted that Moore's Law: The gate width of transistors will halve every 18 months, will end in 2021. Prior estimates ended in 2028. Never fear, monolithic 3D is here. Mask Making and Processing The black would be a metalization mask, here showing the transistor input connection. Other masks are for P+, N+, N well and via (the etch through the SiO2 to allow electrical connection to metal.) The large round wafer, after processing with all the masks, is broke up into many rectangular dies. Each die is placed in a package and the input and output pads on the die are connected to the pins on the package. The die in the package is called a chip or IC chip or Integrated Circuit Chip. "Feature size" is the smallest dimension of metal width, gate width, metal spacing, etc. coming 12 nanometers is 0.000 000 012 meter or less than 1 millionth of an inch. This gets smaller every year or so.

    BYOC

    Build your own computer

    I have built several computers buying a case, motherboard, cpu, ram, drives, video, audio. My older desktop, AMD FX 8-core is Cybertron G1244A 16 GB ram, 1/2 TB SDD. (Replacing my old 12 core AMD that is acting up.) Now new Dell Precision 7920 Tower with 16 cores. You want DDR3, SATA3, SSD we will cover these in future lectures. Look at Homework 1, it is assigned today.

    Lecture 2, Benchmarks

    
    

    The best method of measuring a computers performance is to use benchmarks. Some suggestions from my personal experience preparing a benchmark suite and several updates and personal benchmark experience are presented in pdf format. Lecture 2 Smaller time is better, higher clock frequency is better. time = 1 / frequency T = 1/F and F = 1/T 1 picosecond = 1 / 1 THz 1 nanosecond = 1 / 1 GHz 1 microsecond = 1 / 1 MHz 1 millisecond = 1 / 1 Khz kilohertz KHz = 10^3 cycles per second clock megahertz MHz = 10^6 cycles per second clock gigahertz GHz = 10^9 cycles per second clock terahertz THz = 10^12 cycles per second clock Definitions: CPI Clocks Per Instruction MHz Megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations Per Second MFLOPS Millions of Floating point Operations Per Second MIOPS Millions of Integer Operations Per Second Do not trust your computers clock or the software that reads and processes the time. First: Test the wall clock time against your watch. time_test.c time_test.java time_test.py time_test.f90 The program displays 0, 5, 10, 15 ... at 0 seconds, 5 seconds, 10 seconds etc.

    demonstrate time_test if possible

    Note the use of <time.h> and 'time()' Beware, midnight is zero seconds. Then 60 sec/min * 60 min/hr * 24 hr/day = 86,400 sec/day Just before midnight is 86,399 seconds. Running a benchmark across midnight may give a negative time. Then: Test CPU time, this should be just the time used by the program that is running. With only this program running, checking against your watch should work. time_cpu.c time_cpu.java time_cpu.py The program displays 0, 5, 10, 15 ... at 0 seconds, 5 seconds, 10 seconds etc. Note the use of <time.h> and '(double)clock()/(double)CLOCKS_PER_SEC' I have found one machine with the constant CLOCKS_PER_SECOND completely wrong and another machine with a value 64 that should have been 100. A computer used for real time applications could have a value of 1,000,000 or more. More graphs of FFT benchmarks The source code, C language, for the FFT benchmarks: Note the check run to be sure the code works. Note the non uniform data to avoid special cases. fft_time.c main program fftc.h header file FFT and inverse FFT for various numbers of complex data points The same source code was used for all benchmark measurements. These were optimized for embedded computer use where all constants were burned into rom. fft16.c ifft16.c fft32.c ifft32.c fft64.c ifft64.c fft128.c ifft128.c fft256.c ifft256.c fft512.c ifft512.c fft1024.c ifft1024.c fft2048.c ifft2048.c fft4096.c ifft4096.c Some of the result files: P1-166MHz P1-166MHz -O2 P2-266MHz P2-266MHz -O2 Celeron-500MHz P3-450MHz MS P3-450MHz Linux PPC-2.2GHz PPC-2.5GHz P4-2.53GHz XP Alpha-533MHz XP Xeon-2.8GHz Athlon-1.4GHz MS Athlon-1.4GHz XP Athlon-1.4GHz SuSe Laptop Win7 Laptop Ubuntu What if you are benchmarking a multiprocessor? For example, a two core or quad core, then use both CPU time and wall time to get average processor loading: time_mp2.c for two cores time_mp4.c for quad cores time_mp8.c for two quad cores time_mp12.c for two six cores The output from a two cores is: time_mp2_c.out for two core Xeon The output from four cores is: time_mp4_c.out for Mac quad G5 The output from eight cores is: time_mp8_c.out for AMD 12-core The output from twelve cores is: time_mp12_c.out for AMD 12-core end of time_mp12_c.out file: total CPU time is 342.970000 seconds wall time is 29.000000 seconds average number of processors used = 11.826552 time_mp12.c exiting Similar tests in Java time_test.java time_cpu.java time_mp4.java for quad cores time_mp8.java for eight cores time_mp8.java for eight and twelve cores time_mp4_java.out for quad Xeon G5 time_mp8_java.out for 8 thread Xeon G5 time_mp8_java_fx.out for 8 core AMD FX time_mp12_java.out for 8 thread Xeon G5 time_mp12_12_java.out for 12 core AMD matmul_thread4.java matmul_thread4_java.out Time_test and threads in Python time_test.py time_cpu.py parallel_matmul.py parallel_matmul_py.out OK, since these were old and I did not want to change them, they give some indications of performance on various machines with various operating systems and compiler options. To measure very short times, a higher quality, double-difference method is needed. The following program measures the time to do a double precision floating point add. This may be a time smaller than 1ns, 10^-9 seconds. A test harness is needed to calibrate the loops and make sure dead code elimination can not be used by the compiler. The the item to be tested is placed in a copy of the test harness to make the measurement. The time of the test harness is the stop minus start time in seconds. The time for the measurement is the stop minus start time in seconds. The difference, thus double difference, between the harness and measurement is the time for the item being measured. Here A = A + B with B not known to be a constant by the compiler, is reasonably expected to be a single instruction to add B to a register. If not, we have timed the full statement. The double difference time must be divided by the total number of iterations from the nested loops to get the time for the computer to execute the item once. An attempt is made to get a very stable time measurement. Doubling the number of iterations should double the time. Summary of double difference t1 saved run test harness t2 saved t3 saved run measurement, test harness with item to be timed t4 saved tdiff = (t4-t3) - (t2-t1) t_item = tdiff / number of iterations check against previous time, if not close, double iterations The source code is: time_fadd.c fadd on P4 2.53GHz fadd on Xeon 2.66GHz fadd on Mac 2.5GHz end of Mac output: time_fadd.c ... rep=16384, t measured=0.814363 rep=32768, t measured=1.62344 rep=65536, t measured=3.28666 tmeas=3.28666, t_prev=0, rep=65536 rep=65536, t measured=3.28829 tmeas=3.28829, t_prev=3.28666, rep=65536 time measured=3.28829, under minimum raw time=3.28829, fadd time=5.01629e-10, rep=65536, stable=0.000497342 Some extra information for students wanting to explore their computer: Windows OS Linux OS

    What is in my computer?

    start cd /proc control panel cat cpuinfo system device manager processor etc.

    What processes are running in my computer?

    ctrl-alt-del ps -el process top How do I easily time a program? command prompt time prog < input > output time prog < input > output time The time available through normal software calls may be updated less than 30 times per second to more than a million times per second. A general rule of thumb is to have the time being measured be 10 seconds or more. This will give a reasonable accurate time measurement on all computers. Just repeat what is being measured if it does not run 10 seconds. Some history about computer time reporting. There were time sharing system where you bought time on the computer by the cpu second. There is the cpu time your program requires that is usually called your process time. There is also operating system cpu time. When there are multiple processes running, the operating system time slices, running each job for a short time, called a quanta. The operating system must manage memory, devices, scheduling and related tasks. In the past we had to keep a very close eye on how cpu time was charged to the users process verses the systems processes and was "dead time" the idle process, charged to either. From a users point of view, the user did not request to be swapped out, thus the user does not want any of the operating system time for stopping and restarting the users process to be charged to the user. Another historic tidbit, some Unix systems would add one microsecond to the time reported on each system request for the time. Never allowing the same time to be reported twice even if the clock had not updated. This was to ensure that all disk file times were unique and thus programs such as 'make' would be reliable. For more recent SPEC benchmarks, 2006 is suit date, run 2015,2016,2017,2018,2019,2020 see CPU integer benchmarks,SPECint, floating point benchmarks,SPECfp www.spec.org/cpu2006/Docs/ Some times you just have to buy the top of the line and forget benchmarks. Now find a display with 2,560 by 2,048 resolution! (other than the NASA display) Newegg has an Acer 22 inch HDMI 1920 by 1080 for under $100 in 2013 HDMI replaces VGA connection from computer to display.

    Lecture 3, Performance

    Repeating some definitions: CPI Clocks Per Instruction MHz megahertz, millions of cycles per second MIPS Millions of Instructions Per Second = MHz / CPI MOPS Millions of Operations Per Second MFLOPS Millions of Floating point Operations Per Second MIOPS Millions of Integer Operations Per Second (Classical, old, terms. Today would be billions.) Amdahl's Law (many forms, understand the concept) the part of time improved new time = ------------------------- + the part of time not improved factor improved by old time = the part of time improved + the part of time not improved old time speedup = -------- (always bigger over smaller when faster) new time Given: on some program, the CPU takes 9 sec and the disk I/O takes 1 sec What is the speedup using a CPU 9 times faster? 9 sec Answer: new time = ----- + 1 sec = 2 sec 9 old time = 9 + 1 = 10 sec speedup = 10 / 2 = 5 a pure number ------------------------------------------------------------------------------ Amdahl's Law (many forms, understand the concept) new performance speedup = --------------- old performance Given: Performance of M1 is 100 MFLOPS and 200 MIOPS Performance of M2 is 50 MFLOPS and 250 MIOPS On a program using 10% floating point and 90% integer Which is faster? What is the speedup? Answer; .1 * 100 + .9 * 200 = 190 MIPS .1 * 50 + .9 * 250 = 230 MIPS (M2 is faster) speedup = 230/190 = 1.21 ------------------------------------------------------------------------------ old performance new performance = ----------------------------------------------------- fraction of old improved ------------------------ + fraction of old unimproved improvement factor Given: half of a 100 MIPS machine is speeded up by a factor of 3 what is the speedup relative to the original machine? 1 1 Answer: new performance = --------- * 100 MIPS = ---- * 100 MIPS = 150 MIPS 0.5 .666 --- + 0.5 3 1 speedup = 150 / 100 = 1.5 (same as -------------------------------) fraction improved ------------------ + fraction improvement factor unimproved speedup is a pure number, no units. The units must cancel. ------------------------------------------------------------------------------ SPEC Benchmarks The benchmarks change infrequently, for example 2006 - 2016 same The speed seems to increase every year. SPEC Int2006, 9 in C, 3 in C++ SPEC Flt2006, 17 in assorted Fortran, C, C++ SPEC many rules to follow recent int results recent flt results Note number of core available, results seem to be using just one core. ------------------------------------------------------------------------------ CPI is average Clocks Per Instruction. units: clock/inst MHz is frequency, we use millions of clocks per second. units: clock/sec MIPS is millions of instruction per second. units: inst/sec Note: MIPS=MHz/CPI because (clock/sec) / (clock/inst) = 10^6 inst/sec ( 5/4 of people do not understand fractions. ) Computing average CPI, Clocks Per Instruction -------given--------------- ----------compute------------ type clocks %use product RR 3 25% 3 * 25 = 75 RM 4 50% 4 * 50 = 200 MM 5 25% 5 * 25 = 125 ______ ____ 100% 400 sum 400/100 = 4 average CPI -------given--------------- ----------compute------------ type clocks instructions product RR 3 25,000 3 * 25,000 = 75,000 RM 4 50,000 4 * 50,000 = 200,000 MM 5 25,000 5 * 25,000 = 125,000 _______ _______ 100,000 400,000 sum 400,000/100,000 = 4 average CPI Find the faster sequence of instructions Prog1 vs Prog2 -------given--------------------- type clocks A 1 B 2 C 3 instruction counts for A B C Prog1 2 1 2 Prog2 4 1 1 ----------compute------------------------------ Prog1 A 1 2 1 * 2 = 2 B 2 1 2 * 1 = 2 C 3 2 3 * 2 = 6 __ sum 10 clocks Prog2 A 1 4 1 * 4 = 4 B 2 1 2 * 1 = 2 C 3 1 3 * 1 = 3 __ sum 9 clocks more instructions yet faster speedup = 10 clocks / 9 clocks = 1.111 a number (no units) cs411_opcodes.txt different from Computer Organization and Design 1/8/2020 rd is register destination, the result, general register 1 through 31 rs is the first register, A, source, general register 0 through 31 rt is the second register, B, source, general register 0 through 31 --val---- generally a 16 bit number that gets sign extended --adr---- a 16 bit address, gets sign extended and added to (rx) "i" is generally immediate, operand value is in the instruction Opcode Operands Machine code format 6 5 5 5 5 6 number of bits in field nop RR 00 0 0 0 0 00 add rd,rs,rt RR 00 rs rt rd 0 32 sub rd,rs,rt RR 00 rs rt rd 0 34 mul rd,rs,rt RR 00 rs rt rd 0 27 div rd,rs,rt RR 00 rs rt rd 0 24 and rd,rs,rt RR 00 rs rt rd 0 13 or rd,rs,rt RR 00 rs rt rd 0 15 srl rd,rt,shf RR 00 0 rt rd shf 03 sll rd,rt,shf RR 00 0 rt rd shf 02 cmpl rd,rt RR 00 0 rt rd 0 11 j jadr J 02 ------jadr-------- lwim rd,rs,val M 15 rs rd ---val---- addi rd,rs,val M 12 rs rd ---val---- beq rs,rt,adr M 29 rs rt ---adr---- lw rd,adr(rx) M 35 rx rd ---adr---- sw rt,adr(rx) M 43 rx rt ---adr---- instruction bits (binary of 6 5 5 5 5 6 format above) 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 | | | | | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 nop 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 1 0 0 0 0 0 add r,a,b 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 1 0 0 0 1 0 sub r,a,b 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 0 1 1 0 1 1 mul r,a,b 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 0 1 1 0 0 0 div r,a,b 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 0 0 1 1 0 1 and r,a,b 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 0 0 1 1 1 1 or r,a,b 0 0 0 0 0 0 0 0 0 0 0 b b b b b r r r r r s s s s s 0 0 0 0 1 1 srl r,b,s 0 0 0 0 0 0 0 0 0 0 0 b b b b b r r r r r s s s s s 0 0 0 0 1 0 sll r,b,s 0 0 0 0 0 0 0 0 0 0 0 b b b b b r r r r r -ignored- 0 0 1 0 1 1 cmpl r,b 0 0 0 0 1 0 -----address to bits (27:2) of PC------------------ j adr 0 0 1 1 1 1 x x x x x r r r r r ---2's complement value-------- lwim r,val(x) 0 0 1 1 0 0 x x x x x r r r r r ---2's complement value-------- addi r,val(x) 0 1 1 1 0 1 a a a a a b b b b b ---2's complement address------ beq a,b,adr 1 0 0 0 1 1 x x x x x r r r r r ---2's complement address------ lw r,adr(x) 1 0 1 0 1 1 x x x x x b b b b b ---2's complement address------ sw b,adr(x) Definitions: nop no operation, no programmer visible registers or memory are changed, except PC gets PC+4 j adr bits 0 through 25 of the instruction are inserted into PC(27:2) probably should zero bits PC(1:0) but should be zero already lw r,adr(x) load word into register r from memory location (register x plus sign extended adr field) sw b,adr(x) store word from register b into memory location (register x plus sign extended adr field) beq a,b,adr branch on equal, if the contents of register a are equal to the contents of register b, add the, shifted by two, sign extended adr to the PC (The PC will have 4 added by then) lwim r,val(x) load immediate, the contents of register x is added to the sign extended value and the result put into register r addi r,val(x) add immediate, the contents of register x is added to the sign extended value and the result added to register r add r,a,b add register a to register b and put result into register r sub r,a,b subtract register b from register a and put result into register r mul r,a,b multiply register a by register b and put result into register r div r,a,b divide register a by register b and put result into register r and r,a,b and register a to register b and put result into register r or r,a,b or register a to register b and put result into register r srl r,b,s shift the contents of register b by s places right and put result in register r sll r,b,s shift the contents of register b by s places left and put result in register r cmpl r,b one's complement of register b goes into register r Also: no instructions are to have side effects or additional "features" General register list (applies to MIPS ISA and project) (note: project op codes may differ from MIPS/SGI) Register notes 0 $0 zero value, not writable 1 $1 2 $2 $v0 return values (convention, not constrained by hardware) 3 $3 $v1 4 $4 $a0 arguments (convention, not constrained by hardware) 5 $5 $a1 6 $6 $a2 7 $7 $a3 8 $8 $t0 temporaries(not saved by software convention over calls) 9 $9 $t1 10 $10 $t2 11 $11 $t3 12 $12 $t4 13 $13 $t5 14 $14 $t6 15 $15 $t7 16 $16 $s0 saved by software convention over calls 17 $17 $s1 18 $18 $s2 19 $19 $s3 20 $20 $s4 21 $21 $s5 22 $22 $s6 23 $23 $s7 24 $24 $t8 more temporaries 25 $25 $t9 26 $26 27 $27 28 $28 $gp global pointer ( more designations by software convention) 29 $29 $sp stack pointer 30 $30 $fp frame pointer 31 $31 $ra return address Remember: From a hardware view registers 1 through 31 are general purpose and identical. The above table is just software conventions. Register zero is always zero! Basic digital logic IA-64 Itanium We will cover multicore and parallel processors later. Amdahls law applies to them also. HW2 assignment

    Lecture 4, CPU Operation

    
    We now look at instructions in memory, how they got there and
    how they execute:
    
    1. Start by using an editor to enter compiler language statements.
       The editor writes your source code to a disk file.
    
    2. A compiler reads the source code disk file and produces
       assembly language instructions for a specific ISA that
       will perform your compiler language statements. The assembly
       language is written to a disk file.
    
    3. An assembler reads the assembly language disk file and produces
       a relocatable binary version of your program and writes it to
       a disk file. This may be a main program or just a function or
       subroutine. Typical file name extension is  .o  or  .obj
    
    4. A linkage editor or binder or loader combines the relocatable
       binary files into an executable file. Addresses are relocated
       and typically all instructions are put sequentially in a code
       segment, all constant data in another segment, variables and
       arrays in another segment and possibly making other segments.
       The addresses in all executable files for a specific computer
       start at the same address. These are virtual addresses and the
       operating system will place the segments into RAM at other
       real memory addresses. Windows file extension  .exe
    
    5. A program is executed by having the operating system load the
       executable file into RAM and set the program counter to the
       address of the first instruction that is to be executed in
       the program. All programs might have the same starting address,
       yet the operating system has set up the TLB to translate the
       virtual instruction and data addresses to physical memory addresses.
       The physical addresses are not available to the program or to a
       debugger. This is part of the security an operating system
       provides to prevent one persons program from affecting another
       persons program.
    
    A simple example:
    
      Compiler input        int a, b=4, c=7; 
                            a = b + c;
    
      Assembly language fragment (not unique)
               lw	   $2,12($fp)	  b at 12 offset from frame pointer
    	   lw	   $3,16($fp)	  c at 16 offset from frame pointer
    	   add	   $2,$2,$3	  R format instruction
    	   sw	   $2,8($fp)	  a at 8  offset from frame pointer
    
      Memory addresses in bytes, integer typically 4 bytes, 32 bits.
    
      Loaded in machine
        virtual address   content 32-bits  8-hexadecimal digits
    
        00000000	      8FC2000C  lw $2,12($fp)
        00000004	      8FC30010  lw $3,16($fp)
        00000008	      00000000  nop inserted for pipeline
        0000000C	      00431020  add $2,$2,$3
        00000010	      AFC20008  sw  $2,8,($fp)
    
        $fp has 10000000  (data frame)
        10000000          00000000
        10000004          00000001  
        10000008          00000000?  a  after execution
        1000000C          00000004   b
        10000010          00000007   c
    
    
      Instruction field format for  add $2,$2,$3
        0000 0000 0100 0011 0001 0000 0010 0000  binary for 00431020 hex
    
        vvvv vvss ssst tttt dddd dhhh hhvv vvvv  6,5,5,5,5,6 bit fields
           0   |  2  |   3  |  2  | 0   |  32    decimal values of fields
    
    
      Instruction field format for  lw $2,12($fp)     $fp is register 30
        1000 1111 1100 0010 0000 0000 0000 1100  binary for 8FC2000C hex
    
        vvvv vvxx xxxd dddd aaaa aaaa aaaa aaaa  6,5,5,16 bit fields
          35   | 30  |   2  |        12          decimal values of fields
    
    
    
    The person writing the assembler chose the format of an assembly
    language line. The person designing the ISA chose the format of
    the instruction. Why would you expect them to be in the same order?
    
    
    
    
    A very simplified data flow of the add instruction. From the
    registers to the ALU and back to the registers.
    
    
    
    The VHDL to use the ALU will be given to you as:
    
      ALU: entity WORK.alu_32 port map(inA    => EX_A,
                                       inB    => EX_aluB,
                                       inst   => EX_IR,
                                       result => EX_result);
    
    We will call the upper input "A" and the lower input "B"
    and the output "result".
    The extra input, EX_IR, not shown on the diagram above
    is the instruction the ALU is to perform, add, sub, etc.
    
    
    The instructions we will use in this course are specifically:
    
      cs411_opcodes.txt
    
    Each student needs to understand what the instructions are
    and the use of each field in each instruction.
    (Note: a few have bit patterns different from the book and
     different from previous semesters in order to prevent copying.)
    
    Our MIPS architecture computer uses five clocks to execute
    a load word instruction.
    
     1 0 0 0 1 1 x x x x x r r r r r ---2's complement address------ lw r,adr(x)
    
      1. Fetch the instruction from memory
      2. Decode the instruction and read the value of register xxxxx
      3. Compute the memory address by adding the sign extended bottom
         16 bits of the instruction to the contents of register xxxxx.
      4. Fetch the data word from the memory address.
      5. Write the data word from memory into the register rrrrr.
    
    
    
    When we cover "pipelining" you will see why five clocks are
    used for every instruction, even though some instructions
    need less than five.
    
    
    Computer languages come in many varieties.
    The information above applies to languages such
    as C, C++, Fortran, Ada and others.
    
    Many languages abstract the concept of binary relocatable
    code, in what was originally called "crunch code".
    These languages use their own form of intermediate files.
    For example Pascal, Java, Python and others.
    
    Other languages directly interpret the users source
    files, possibly with some preprocessing.
    For example SML, Haskel, Lisp, MatLab, Mathematica and
    others.
    
    With a completely new computer architecture, the first
    "language" would be an assembly language. From this,
    a primitive operating system would be built. Then,
    typically an existing C compiler would be modified
    for the new computer architecture. An alternative is
    to build a cross compiler from C and gas, to
    bootstrap existing code to the new architecture.
    From then on, "reuse" goes into full effect and
    millions of lines of existing software can be
    running on the new computer architecture. 
    
    
    For Homework 3
      The computer irix.gl.umbc.edu  is no longer available.
      This was a MIPS architecture using the same instructions
      as we are using. The MIPS architecture is studied because
      it is a much simpler and easier to understand architecture
      than the Intel X86, IA-32.
    
      Thus, to see the instructions in RAM, we will use the  gdb
      debugger on an Intel X86.
    
    HW3 information
    
    The information in hex.out will have lines similar to:
    
    
    (gdb) disassemble
    Dump of assembler code for function main:
    
     RAM addr    offset    op code  address and register
    0x08048384 <main+0>:	lea    0x4(%esp),%ecx
    0x08048388 <main+4>:	and    $0xfffffff0,%esp
    0x0804838b <main+7>:	pushl  0xfffffffc(%ecx)
    
    End of assembler dump.
    (gdb) x/60x main
                         Note: 16 bytes per line, 4  32-bit words
                         but, these are X86 instructions, not MIPS !
    0x8048384 <main>:    0x04244c8d 0xfff0e483 0x8955fc71 0x535657e5
    0x8048394 <main+16>: 0x58ec8351 0x4589e089 0xe445c7cc 0x00000064
                                 ##                               ##
                    <main+19>----|                    <main+31>---|
                    0x8048397                         0x80483A3
    
    Because the MIPS architecture we are studying is a big endian
    machine, we will count bytes from left to right for homework 3.
    
    In hexadecimal, 0x12345678 is stored big end first     12
                                                           34
                                                           56
                                                           78
    
    Little endian   0x12345678 is stored little end first  78
                                                           56
                                                           34
    Each byte, 8 bits, is two hex digits                   12
    
    

    Lecture 5, Instructions and Registers

    Get paper handout, fill in values for registers and memory
    as we discuss the instructions in this lecture.
    The program starts with PC set to address zero.
    The instructions are defined on cs411_opcodes.txt
    
    part1.asm
    part1.abs
    
    part1.abs
    address  instruction    assembly language
    
    00000000 8C010074 	lw   $1,w1($0)
    00000004 8C020078 	lw   $2,w2($0)
    00000008 8C03007C 	lw   $3,w3($0)
    0000000C 00000000 	nop
    00000010 00000000 	nop
    00000014 00232020 	add  $4,$1,$3
    00000018 00222822 	sub  $5,$1,$2
    0000001C 000133C2 	sll  $6,$1,15
    00000020 00023C03 	srl  $7,$2,16
    00000024 0003400B 	cmpl $8,$3
    00000028 0022480D 	or   $9,$1,$2
    0000002C 0023500F 	and  $10,$1,$3
    00000030 00435818 	div  $11,$2,$3
    00000034 0062601B 	mul  $12,$3,$2
    00000038 AC010080 	sw   $1,w4($0)
    0000003C 300D0074 	addi $13,w1
    00000040 00000000 	nop
    00000044 00000000 	nop
    00000048 8DAE0004 	lw   $14,4($13)
    0000004C 31AF0008 	addi $15,8($13)
    00000050 3C100010 	lwim $16,16
    00000054 00000000 	nop
    00000058 00000000 	nop
    0000005C ADE30008 	sw   $3,8($15) 
    00000060 00000000 	nop
    00000064 00000000 	nop
    00000068 00000000 	nop
    0000006C 00000000 	nop
    00000070 00000000 	nop
    00000074 11111111 w1:	word 0x11111111
    00000078 22222222 w2:	word 0x22222222
    0000007C 33333333 w3:	word 0x33333333
    00000080 44444444 w4:	word 0x44444444
    
    
    
    After the CPU has executed the first instruction:
    General Registers                          RAM memory
                                                initial    final
     $0   00000000
          --------
     $1   11111111                     00000074  11111111
          --------                               --------  ____________
     $2                                00000078  22222222
         ______________                          --------  ____________
     $3                                0000007c  33333333
         ______________                          --------  ____________
     $4                                00000080  44444444
         ______________                          --------  ____________
     $5                                00000084  xxxxxxxx
         ______________                          --------  ____________
     $6
         ______________
     $7
         ______________
     $8
         ______________
     $9
         ______________
    $10
         ______________
    $11
         ______________
    $12
         ______________
    
    
    This is part of your project: part1.abs
    and the result of running that small program part1.chk: 
    
    part1.chk
    
    Note the large amount of information printed each clock time.
    Note that it takes 5 clock cycles to finish an instruction.
    
    Basic MUX Truth Table and Schematic
    
    
    
    How MUX are used to route data
    
    
    
    You can see much of the code for the above in the
    starter code for Proj1:
    part1_start.vhdl
    
    There are basic design principles for computer architecture
    and many apply to broader applications.
    
    Design Principle 1: 
    
           Simplicity is best achieved through regularity.
    
           A few building blocks, used systematically, will have
           fewer errors, be available sooner and sell for less.
           A uniform instruction set allows better compilers.
    
    Design Principle 2:
    
           Smaller is faster:
    
           Smaller feature size means signals can move faster.
           Shorter paths, less stages, allow completion sooner.
    
    Design Principle 3:
    
           Good design requires good compromises.
    
           There are no perfect architectures. There are kluges.
    
    Design Principle 4:
    
           Make the common part fast.
           Amdahl's law, be sure you are maximizing speedup.
    
    
    Pentium 4 Hyper threading
    
    Intel Core Duo
    
    AMD quad core, one core shown
    
    $329 for just 1/2 quad core processor
    
    a 4 CPU, 8GB RAM configuration
    Now 12-core 16GB RAM, 3 hard drives, 2 DVD writers
    
    Practice safe computing!
    
    Beware Malware, Spyware and Adware:
    
    Do everything you can to keep malware from infecting your systems,
    malware authors do all they can to keep their work from being
    detected and removed. By looking at the methods that malware uses
    to keep itself safe, you can better root it out and remove 
    it before the damage is done. Downloading attachments is the
    primary way malware gets into your system.
    
    
    HW3 assigned
    
    

    Lecture 6, VHDL introduction

    VHDL is used for structural and functional modeling of digital circuits. The geometric modeling is handled by other Cadence programs. First, simple VHDL statements for logic gates: logic gates and corresponding VHDL statements VHDL comments start with -- acting like C++ and Java // VHDL like C++ and Java end statements with a semicolon ; VHDL uses "library" and "use" where C++ uses #include Java uses import VHDL uses ".all" where Java uses ".*" VHDL uses names similar to Pascal, case insensitive, var is same as Var, VAR VHDL has a two part basic structure for each circuit that is more than one gate, the "entity" and the "architecture". There needs to be a "library" and "use" for features that are used. The word "port" is used to mean interface. The term "std_logic" is a type used for one bit. The term "std_logic_vector" is a type used for more than one bit. The time from an input changing to when the output may change is optional. "after 1 ps" indicates 1 pico second. "after 2 ns" indicates 2 nano seconds. This circuit is coded as a full adder component in VHDL: library IEEE; use IEEE.std_logic_1164.all; entity fadd is -- full adder stage, interface port(a : in std_logic; b : in std_logic; cin : in std_logic; s : out std_logic; cout : out std_logic); end entity fadd; architecture circuits of fadd is -- full adder stage, body begin -- circuits of fadd s <= a xor b xor cin after 1 ps; cout <= (a and b) or (a and cin) or (b and cin) after 1 ps; end architecture circuits; -- of fadd Notice that entity fadd is ... end entity fadd; is a statement Notice that architecture circuits of fadd is ... end architecture circuits; is a statement. The "of fadd" connects the architecture to the entity. The arbitrary signal names a, b, cin, s, cout were required to be assigned a type, std_logic in this case, before being used. Typical for many programming languages. Now, use a loop to combine 32 fadd into a 32 bit adder: Note: to use fadd , a long statement must be used a0: entity WORK.fadd port map(a(0), b(0), cin, sum(0), c(0)); A unique label a0 followed by a colon : Then entity WORK.fadd naming the entity to be used in WORK library. Then port map( with actual signals for a, b, cin, s, cout ) Note subscripts for bit numbers in parenthesis, not [] . The first and last stage are slightly different from the 30 stages in the loop. add32.vhdl using the fadd above Another variation of an adder, propagate generate. add32pg_start.vhdl for HW4 A "main" entity to use the component add32 with test data. Note: just structure of "entity" then big architecture entity tadd32 is -- test bench for add32.vhdl end tadd32; -- no requirement to use "main" architecture circuits of tadd32 is ... tadd32.vhdl for main entity for HW4 The additional file tadd32.run was needed to tell the VHDL simulator how long to run: tadd32.run used to stop simulation output of cadence simulation The cadence output from the write statements in tadd32.vhdl is: tadd32.chk output of tadd32.vhdl The GHDL output from the write statements in tadd32.vhdl is: tadd32.chkg output of tadd32.vhdl The command line commands for using cadence are: run_ncvhdl.bash -v93 -messages -linedebug -cdslib ~/cs411/vhdl2/cds.lib -hdlvar ~/cs411/vhdl2/hdl.var -smartorder add32.vhdl tadd32.vhdl run_ncelab.bash -v93 -messages -access rwc -cdslib ~/cs411/vhdl2/cds.lib -hdlvar ~/cs411/vhdl2/hdl.var tadd32 run_ncsim.bash -input tadd32.run -batch -logfile tadd32.out -messages -cdslib ~/cs411/vhdl2/cds.lib -hdlvar ~/cs411/vhdl2/hdl.var tadd32 Or use make -f Makefile_411 tadd32.out diff -iw tadd32.out tadd32.chk The command line commands for using GHDL are: ghdl -a --ieee=synopsys add32.vhdl ghdl -a --ieee=synopsys tadd32.vhdl ghdl -e --ieee=synopsys tadd32 ghdl -r --ieee=synopsys tadd32 --stop-time=65ns > tadd32.gout diff -iw tadd32.gout tadd32.chkg Or use make -f Makefile_ghdl tadd32.gout output of simulation Use a Makefile for sets of commands. You will be running more than once to get homework and projects correct. I provide a Makefile_411 for cadence and Makefile_ghdl for GHDL. Browse and use as a reference for HW4, HW6, and Project. You must do the setup exactly as stated in HW4 Sample designs and corresponding VHDL code VHDL Language Compact Summary The setup for HW4, HW6 and Project will be covered in the next lecture. You will be using command lines in a terminal window on linux.gl.umbc.edu You are given a cs411.tar file that creates the needed directories for Cadence. Makefile_ghdl sets up Makefile for GHDL. You will be modifying a Makefile for HW4, HW6, and Project parts. The basic VHDL commands are shown in the Makefile's Makefile_411 for Cadence Makefile_ghdl for GHDL

    Lecture 7, Arithmetic

    The number system of interest in computer architecture re:
      Sign Magnitude - binary magnitude with sign bit
      Ones Complement - negative numbers have all bits inverted
      Twos Complement - Ones Complement with one added to lsb
    
      All number systems have the sign bit 0 for positive and
      1 for negative. The msb is the sign bit and thus the
      word length is important.
    
     Number systems, using 4-bit words
    
     Hex   Binary  Sign       Ones        Twos
     Digit Bits    Magnitude  Complement  Complement
    
      0    0000     0          0           0
      1    0001     1          1           1
      2    0010     2          2           2
      3    0011     3          3           3
      4    0100     4          4           4
      5    0101     5          5           5
      6    0110     6          6           6
      7    0111     7          7           7
      8    1000    -0         -7          -8  difference starts here
      9    1001    -1         -6          -7
      A    1010    -2         -5          -6
      B    1011    -3         -4          -5
      C    1100    -4         -3          -4
      D    1101    -5         -2          -3
      E    1110    -6         -1          -2
      F    1111    -7         -0          -1
    
     to negate:    invert     invert      invert all bits
                   sign       all bits    and add one
    
     math -(-N)=N   OK         OK          -(-8)=-8 YUK!
    
    
     Addition      Sign       Ones        Twos
                   Magnitude  Complement  Complement
    
        2          0010       0010        0010
       +3          0011       0011        0011
      ___          ----       ----        ----
       +5          0101       0101        0101
                   OK
    
        4          0100       0100        0100
       +5          0101       0101        0101
      ---          ----       ----        ----
        9          1001       1001        1001
                    -1         -6          -7
                   overflow gives wrong answer on
                   fixed length, computer, numbers
    
     Subtraction: negate second operand and add
    
        4          0100       0100        0100
       -5          1101       1010        1011
      ---          ----       ----        ----
       -1          1001       1110        1111
                    -1         -1          -1
                   works, using correct definition of negate
    
    
          Sign Magnitude bigger minus smaller, fix sign 
          Twos Complement, just add. Most computers today
          Ones Complement, just add. e.g. Univac computers
    
     It was discovered the "add one" was almost
     zero cost, thus most integer arithmetic is
     twos complement.
    
     The hardware adder has a carry-in input that implements
     the "add one" by making this input a "1".
    
    Basic one bit adder, called a full adder.
    
    
    
    Combining four full adders to make a 4-bit adder.
    
    
    
    Combining eight 4-bit adders to make a 32-bit adder.
    
    
    
    A quick look at VHDL that implements the above diagrams,
    with some optimization, is an add32
    
    
    Using a multiplexor with 32-bit adder for subtraction.
    "sub" is '1' for subtract, '0' for add.
    (NC is no connection, use  open  in VHDL)
    
    
    
    
    There are many types of adders. "Bit slice" will be covered in the
    next lecture on the ALU. First, related to Homework 4 is the
    "propagate generate" adder, then the "Square root N" adder for
    Computer Engineering majors.
    
    The "Propagate Generate" PG adder has a propagation time
    proportional to log_2 N for N bits.
    
    
    
    
    The "add4pg" unit has four full adders and extra circuits,
    defined by equations rather than logic gates:
    -- add4pg.vhdl     entity and architecture
    --                 for 4 bits of a propagate-generate, pg, adder
    library IEEE;
    use IEEE.std_logic_1164.all;
    entity add4pg is
      port(a    : in  std_logic_vector(3 downto 0);
           b    : in  std_logic_vector(3 downto 0);
           cin  : in  std_logic; 
           sum  : out std_logic_vector(3 downto 0);
           p    : out std_logic;
           g    : out std_logic );
    end entity add4pg ;
    
    architecture circuits of add4pg is
      signal c : std_logic_vector(2 downto 0);
    begin  -- circuits of add4pg
      sum(0) <= a(0) xor b(0) xor cin after 2 ps;
      c(0)   <= (a(0) and b(0)) or (a(0) and cin) or (b(0) and cin) after 2 ps;
      sum(1) <= a(1) xor b(1) xor c(0) after 2 ps;
      c(1)   <= (a(1) and b(1)) or
                (a(1) and a(0) and b(0)) or
                (a(1) and a(0) and cin)  or
                (a(1) and b(0) and cin)  or
                (b(1) and a(0) and b(0)) or
                (b(1) and a(0) and cin)  or
                (b(1) and b(0) and cin) after 2 ps;
      sum(2) <= a(2) xor b(2) xor c(1) after 2 ps;
      c(2)   <= (a(2) and b(2)) or (a(2) and c(1)) or (b(2) and c(1)) after 2 ps;
      sum(3) <= a(3) xor b(3) xor c(2) after 2 ps;
      p      <= (a(0) or b(0)) and (a(1) or b(1)) and
                (a(2) or b(2)) and (a(3) or b(3)) after 2 ps;
      g      <= (a(3) and b(3)) or ((a(3) or b(3)) and
                ((a(2) and b(2)) or ((a(2) or b(2)) and
                ((a(1) and b(1)) or ((a(1) or b(1)) and
                ((a(0) and b(0)))))))) after 2 ps;
    end architecture circuits;  -- of add4pg
    
    
    
    The "PG4" box is defined by equations and thus no schematic:
    -- pg4.vhdl    entity and architecture  Carry-Lookahead unit
    --             pg4 is driven by four add4pg entities 
    library IEEE;
    use IEEE.std_logic_1164.all;
    entity pg4 is 
      port(p0   : in  std_logic;
           p1   : in  std_logic;
           p2   : in  std_logic; 
           p3   : in  std_logic;
           g0   : in  std_logic;
           g1   : in  std_logic;
           g2   : in  std_logic; 
           g3   : in  std_logic;
           cin  : in  std_logic;
           c1   : out std_logic;
           c2   : out std_logic;
           c3   : out std_logic;
           c4   : out std_logic);
    end entity pg4 ;
    
    architecture circuits of pg4 is
    begin  -- circuits of pg4
      c1   <= g0 or (p0 and cin) after 2 ps;
      c2   <= g1 or (p1 and g0) or (p1 and p0 and cin) after 2 ps;
      c3   <= g2 or (p2 and g1) or (p2 and p1 and g0) or
              (p2 and p1 and p0 and cin) after 2 ps;
      c4   <= g3 or
              (p3 and g2) or
              (p3 and p2 and g1) or
              (p3 and p2 and p1 and g0) or
              (p3 and p2 and p1 and p0 and cin) after 2 ps;
    end architecture circuits;  -- of pg4
    
    
    
    
    The "Carry Select" CS, adder gets increased speed from computing
    the possible output with carry in to that stage being both
    '0' and '1'. The "Carry Select" adder has a propagation time
    proportional to sqrt(N) for N bits.
    
    
    
    
    
    
    
    The above diagram has only 10 bits drawn.
    You need 32 bits. Thus you need additional group of 5,
    group of 6, group of 7, and a final group of 4.
    1+2+3+4+5+6+7+4=32
    
    If N = 64,  log2 N = 6,  sqrt(N) = 8  speedup vs complexity (size)
    
    Behavioral VHDL for our add32:
    
    library IEEE;
    use IEEE.std_logic_1164.all;
    entity add32 is
      port(a    : in  std_logic_vector(31 downto 0);
           b    : in  std_logic_vector(31 downto 0);
           cin  : in  std_logic; 
           sum  : out std_logic_vector(31 downto 0);
           cout : out std_logic);
    end entity add32; -- same for all implementations
    
    library IEEE;
    use IEEE.std_logic_arith.all;
    architecture behavior of add32 is
      signal temp : std_logic_vector(32 downto 0);
      signal vcin : std_logic_vector(32 downto 0) := X"00000000"&'0';
      signal va   : std_logic_vector(32 downto 0) := X"00000000"&'0';
      signal vb   : std_logic_vector(32 downto 0) := X"00000000"&'0';
      -- 33 bits (32 downto 0) needed to compute cout
    begin  -- circuits of add32
      vcin(0) <= cin;
      va(31 downto 0) <= a;
      vb(31 downto 0) <= b;
      temp <= unsigned(va) + unsigned(vb) + unsigned(vcin); -- 33 bit add
      cout <= temp(32) after 6 ps;
      sum  <= temp(31 downto 0) after 6 ps;
    end architecture behavior;  -- of add32
    
      
    
    Now go to Homework 4 and the setup commands.
    
    Expect errors. Nobody's perfect.
         For many errors after typing 'make'
         touch add32.vhdl
         make |& more   # hit space for next page, enter for next line
         make >& add32.prt   # results, including error go to a file
                             # use editor to read file, you can search
    
         FIX THE FIRST ERROR !!!!
         Yes, you can fix other errors also, but one error can cause
         a cascading effect and produce many errors.
    
         Don't panic when there was only one error, you fixed that,
         then the next run you get 37 errors. The compiler has stages,
         it stops on a stage if there is an error. Fixing that error
         lets the compiler move to the next stage and check for other
         types of errors.
    
         Don't give up. Don't make wild guesses. Do experiment with
         one change at a time. You may actually have to read some
         of the handouts :)
    
         Cadence VHDL error message. (actually an extra semicolon)
    
    ncvhdl: 05.40-s011: (c) Copyright 1995-2005 Cadence Design Systems, Inc.
           OUTT : out std_logic;);
                                |
    ncvhdl_p: *E,PORNKW (error.vhdl,10|28): identifier expected.
           OUTT : out std_logic;);
    
    Then to VHDL resource.
    
    

    Lecture 8, ALU

    The Arithmetic Logic Unit is the section of the CPU that actually
    performs add, subtract, multiply, divide, and, or, floating point and
    other operations. The choice of which operations are implemented is
    determined by the Instruction Set Architecture, ISA. Most modern
    computers separate the integer unit from the floating point unit.
    Many modern architectures have simple integer, complex integer, and
    an assortment of floating point units.
    
    
    
    
    The ALU gets inputs from registers reg_use.jpg
    
    Where did numbers such as 100010 for subop and  000010 for sllop
    come from ? cs411_opcodes.txt
    
    
    -- alu_start.vhdl
    
    library IEEE;
    use IEEE.std_logic_1164.all;
    
    entity alu_32 is
      port(inA    : in  std_logic_vector (31 downto 0);
           inB    : in  std_logic_vector (31 downto 0);
           inst   : in  std_logic_vector (31 downto 0);
           result : out std_logic_vector (31 downto 0));
    end entity alu_32;
    
    
    architecture schematic of alu_32 is 
      signal cin     : std_logic := '0';
      signal cout    : std_logic;
    begin  -- schematic
      --
      --   REPLACE THIS SECTION FOR PROJECT PART 1
      --   (add the signals you need above the "begin"
      --    add logic below the "begin")
      
      adder: entity WORK.add32 port map(a    => inA,
                                        b    => inB,     -- change
                                        cin  => cin,     -- change
                                        sum  => result,  -- change
                                        cout => cout);
    
    -- examples of entity instantiations:
      
    -- bsh: entity WORK.bshift port map (left    => sllop,
    --                                   logical => '1',
    --                                   shift   => inst(10 downto 6),
    --                                   input   => inB,
    --                                   output  => bresult);
    
    -- r1: entity WORK.equal6  port map (inst  => inst(31 downto 26),
    --                                   test  => "000000",
    --                                   equal => rrop);
    
    -- s1: entity WORK.equal6  port map (inst  => inst(5 downto 0),
    --                                   test  => "100010",         -- 34
    --                                   equal => subop1);
    -- s1a: subop <= subop1 and rrop;
    
    
    --      S_sel <= sllop_or_srlop; -- for mux32_6
    
    -- much more
       
    end architecture schematic;  -- of alu_32
    
    Many variations of  subop, subop1, subop_and, subopa
    Your starter part1ce_start.vhdl  uses  subopa short for subop_and.  		 
    part1ce_start.vhdl
    
    
    
    
    
    -- mux32_3.vhdl
    
    library IEEE;
    use IEEE.std_logic_1164.all;
    entity mux32_3 is
      port(in0    : in  std_logic_vector (31 downto 0);
           in1    : in  std_logic_vector (31 downto 0);
           in2    : in  std_logic_vector (31 downto 0);
           ct1    : in  std_logic;          -- pass in1(has priority)
           ct2    : in  std_logic;          -- pass in2
           result : out std_logic_vector (31 downto 0));
    end entity mux32_3;
    
    architecture behavior of mux32_3 is 
    begin  -- behavior -- no process needed with concurrent statements
      result <= in1 when ct1='1' else in2 when ct2='1' else in0 after 50 ps;
    end architecture behavior;  -- of mux32_3
    
    -- mux_32_6.vhdl  have only zero or one  ctl  ='1'
    
    library IEEE;
    use IEEE.std_logic_1164.all;
    
    entity mux_32_6 is
      port(in0    : in  std_logic_vector (31 downto 0);
           in1    : in  std_logic_vector (31 downto 0);
           in2    : in  std_logic_vector (31 downto 0);
           in3    : in  std_logic_vector (31 downto 0);
           in4    : in  std_logic_vector (31 downto 0);
           in5    : in  std_logic_vector (31 downto 0);
           ctl1   : in  std_logic;
           ctl2   : in  std_logic;
           ctl3   : in  std_logic;
           ctl4   : in  std_logic;
           ctl5   : in  std_logic;
           result : out std_logic_vector (31 downto 0));
    end entity mux_32_6;
    
    architecture behavior of mux_32_6 is 
    begin  -- behavior -- no process needed with concurrent statements
      result <= in1 when ctl1='1' else in2 when ctl2='1' else
                in3 when ctl3='1' else in4 when ctl4='1' else
                in5 when ctl5='1' else in0 after 10 ps;
    end architecture behavior;  -- of mux_32_6
    
    
    
    
    
    
    Note that bshift.vhdl contains two different architectures
    for the same entity. A behavioral architecture using sequential
    programming and a circuits architecture using digital logic
    components.
    
    bshift.vhdl
    
    
    An 8-bit version of shift right logical, using single bit signals,
    three bit shift count, is:
    
    
    
    
    
    
    There are many ways to build an ALU. Often the choice is based
    on mask making and requires a repeated pattern. The "bit slice"
    method uses the same structure for every bit. One example is:
    
    
    
    Note that 'Operation' is two bits, 0 for logical and, 1 for logical or,
    2 for add or subtract, and 3 for an operation called set used for
    comparison.
    'Binvert' and 'CarryIn' would be set to '1' for subtract.
    'Binvert' and 'a' set to '0' would be complement.
    The overflow detection is in every stage yet only used in the
    last stage.
    
    The bit slices are wired together to form a simple ALU:
    
    
    
    The 'set' operation would give non zero if 'a' < 'b' and
    zero otherwise. A possible condition status or register
    value for a "beq" instruction.
    
    
    If overflow was to be detected, the circuit below uses the
    sign bit of the A and B inputs and the sign bit of the
    result to detect overflow on twos complement addition.
    
    
     
    
    
    
    The ALU fits into the machine architecture as shown below:
    
    
    
    
    
    32-bit and 64-bit  ALU  architectures are available.
    
    A 64-bit architecture, by definition, has 64-bit integer registers.
    Many computers have had 64-bit IEEE floating point for many years.
    The 64-bit machines have been around for a while as the Alpha and
    PowerPC yet have become popular for the desktop with the Intel and
    AMD 64-bit machines.
    
    
    
    Software has been dragging well behind computer architecture.
    The chaos started in 1979 with the following "choices."
    
    
    
    The full whitepaper www.unix.org/whitepapers/64bit.html
    
    My desire is to have the compiler, linker and operating system be ILP64.
    All my code would work fine. I make no assumptions about word length.
    I use sizeof(int)  sizeof(size_t) etc. when absolutely needed.
    On my 8GB computer I use a single array of over 4GB thus the subscripts
    must be 64-bit. The only option, I know of, for gcc is  -m64 and that
    just gives LP64. Yuk! I have to change my source code and use "long"
    everywhere in place of "int". If you get the idea that I am angry with
    the compiler vendors, you are correct!
    
    Here are sample programs and output to test for 64-bit capability in gcc:
    
    Get sizeof on types and variables big.c
    
    output from  gcc -m64 big.c  big.out
    
    malloc more than 4GB  big_malloc.c
    
    output from  big_malloc_mac.out
    
    Newer Operating Systems and compilers (note 'sizeof' changed to long)
    Get sizeof on types and variables big12.c
    
    output from  gcc big12.c  big12.out
    
    
    
    
    The early 64-bit computers were:
    
    DEC Alpha
    
    DEC Alpha
    
    IBM PowerPC
    
    
    Some history of 64-bit computers:
    
    
    
    
    
    Java for 64-bit, source compatible
    
    Then to VHDL resource, FPGA.
    get free GHDL
    
    

    Lecture 9, Multiply

    
    Standard decimal and binary multiplication could look like:
    
                234          01010             multiplicand
              x 121        x 00011           x   multiplier
             ------       --------           --------------
                234          01010                  product
               468          01010
              234          00000
             ------       00000
             028314      00000
             |          ----------
             |          0000011110  5-bits times 5-bits gives a 10-bit product,
             |                      in a computer leading zeros are kept.
             |
             3-digits times 3-digits gives a 6-digit product, yet in
             decimal, we do not write the leading zeros.
    
    We have covered how computer adders work and how they are built.
    Exactly two numbers are added to produce one sum, thus the binary
    multiply above needs to be rewritten as:
    
                            01010
                          x 00011
                       ----------
                           001010 -- multiplier LSB anded with multiplicand
                         + 01010  -- multiplier bit-1 anded with multiplicand
                           -----
                          0011110 -- partial sum, bottom bit passed down
                        + 00000   -- multiplier bit-2 anded with multiplicand
                          -----
                         00011110 -- partial sum, bottom two bits passed down
                       + 00000    -- multiplier bit-3 anded with multiplicand
                         -----
                        000011110 -- partial sum, bottom three bits passed down
                      + 00000     -- multiplier bit-4 anded with multiplicand
                        -----
                       0000011110 -- final product, four bits passed down
    
    Thus, by this simple method, with a 5-bit unsigned multiplier, there
    are four additions needed. A circuit that uses one adder and performs
    serial multiplication follows directly. This design chose to use a
    multiplexor rather than an 'and' operation to select the multiplicand
    or zero. 
    
    How a register works
    
    
    
    The VHDL code that represents the above circuit is:
    
      mula  <= hi;
      mulb  <= md when (lo(0)='1') else x"00000000" after 50 ps;
      adder:entity WORK.add32 port map(mula, mulb, '0', muls, cout);
      hi <= cout & muls(31 downto 1) when mulclk'event and mulclk='1';
      lo <= muls(0) & lo(31 downto 1) when mulclk'event and mulclk='1';
    
    The signal "mulclk" runs for the number of clock cycles that
    their are bits in the multiplier, 32 for this example. For
    simplicity of design, zero is added in the first step. Note that
    "cout" is used when loading the "hi" register. The shifting is
    accomplished by wire routing. 
    
    The VHDL test source code is mul_ser.vhdl
    
    The output from the test is mul_ser.out
    
    P.S. The above was an introduction, never use that method or circuit.
    
    A serial multiplier can be built using only half as many clock cycles.
    We use the technique developed by Mr. Booth. Two multiplier bits are
    used each clock cycle. Only one add operation is needed each cycle,
    yet the augend has several possible values as shown by the
    multiplexor in the schematic and the table in the VHDL source code.
    
    
    
    
    The VHDL test source code is bmul_ser.vhdl
    
    The output from the test is bmul_ser.out
    
    
    Next, parallel multiplication with a carry-save design.
    Note there is no carry propagation except in the last stage.
    
    
    
    
    
    
    
    
    Some fancy VHDL using double subscripting and "generate".
    pmul4.vhdl
    
    
    A 32 bit design using an add32csa entity is:
    
    
    
    
    
    
    The VHDL entity for the carry-save multiplier is mul32c.vhdl
    The VHDL test source code is mul32c_test.vhdl
    The output from the test is mul32c_test.out
    
    
    We can now combine the Booth multiplication technique to reduce the
    number of stages in half, still using the parallel multiply.
    The VHDL was written without a diagram, thus no schematic exists, yet.
    
    The VHDL entity for the carry-save multiplier is bmul32.vhdl
    The VHDL test source code is bmul32_test.vhdl
    The output from the test is bmul32_test.out
    
    Homework 5 is assigned
    
    

    Lecture 10, Divide

    Hopefully you understand decimal division:
    
                       49  quotient
                    ______
      divisor   47 / 2345  dividend
                     188
                     ---
                      465
                      423
                      ---
                       42  remainder
    
    
    And check division by multiplication:
    
                    49  multiplicand is the quotient above
                 x  47  multiplier is the divisor above
                  ----
                  2303
                +   42  add the remainder above
                  ----
                  2345  final sum is the dividend above
    
    
    A smaller case that is used below in binary:
    
                       12  quotient
                      ___
          divisor  7 / 85  dividend
                       7
                       --
                       15
                       14
                       --
                        1  remainder
    
    
    Binary divide,  conventional method and non restoring method
    
      These examples are shown in a form that can be directly
      implemented in a computer architecture.
    
      The divisor, quotient and remainder are each one word.
      The dividend is two words.
      The equations   dividend = quotient * divisor + remainder
      and             |remainder| < |divisor|
      must be satisfied.
      When a choice is possible, choose the sign of the remainder to
      be the same as the sign of the dividend.
    
      Save the sign bits of the dividend and divisor, if necessary,
      negate the dividend and divisor to make them positive.
      Fix up the sign bits of the quotient and dividend after dividing.
    
      Example:  dividend = 85 ,  divisor = 7
    
      Decimal divide  85 / 7 = quotient 12 , remainder 1     
    
    
    Restoring (conventional) binary divide, twos complement 4-bit numbers
    
                                    1 1 0 0   quotient
                           ________________
                 0 1 1 1  / 0 1 0 1 0 1 0 1
                           -0 1 1 1      may subtract by adding twos complement
                            _______          - 0 1 1 1   is   1 0 0 1
       5 - 7 = -2           1 1 1 0
       negative, add 7     +0 1 1 1
       restored             _______
       next bit               1 0 1 0
                             -0 1 1 1
                              _______
       10 - 7 = 3               0 1 1 1
       quotient=1, next bit    -0 1 1 1
                                _______
       7 - 7 = 0                0 0 0 0 0
       quotient=1, next bit      -0 1 1 1
                                  _______
       0 - 7 = -7                 1 0 0 1
       negative, add 7           +0 1 1 1
       quotient=0                 _______
       restored, next bit           0 0 0 1
                                   -0 1 1 1
                                    _______
       1 - 7 = -6                   1 0 1 0
       negative, add 7             +0 1 1 1
       quotient=0                   _______
       restored, finished           0 0 0 1   final remainder
       (8 cycles using adder)
    
    
    Clock cycles can be saved by not performing the "restored" operation.
    
      non-restoring binary divide, twos complement 4-bit numbers
      note: 7 = 0 1 1 1     -7 = 1 0 0 1
    
    
                                    1 1 0 0   quotient
                           ________________
                 0 1 1 1  / 0 1 0 1 0 1 0 1
       pre shift             +1 0 0 1         adding twos complement of divisor
                              _______
       10 - 7 = 3             0 0 1 1 1
       quotient=1              +1 0 0 1
       next bit subtract        _______
       7 - 7 = 0                0 0 0 0 0
       quotient=1                +1 0 0 1
       next bit subtract          _______
       0 - 7 = -7                 1 0 0 1 1
       quotient=0                  +0 1 1 1    adding divisor
       next bit add                 _______
       2 + 7 = 9 = -7               1 0 1 0
       quotient=0                  +0 1 1 1
       correction add               _______
       final remainder              0 0 0 1    remainder
       (5 cycles using adder)
    
    
    Correcting signs:
          dividend  divisor |  quotient  remainder
          ------------------+--------------------
             +        +     |      +        +      +85 / +7 = +12  R +1
             +        -     |      -        +      +85 / -7 = -12  R +1
             -        +     |      -        -      -85 / +7 = -12  R -1
             -        -     |      +        -      -85 / -7 = +12  R -1
    
    
    Humans, not the computer, keeps track of the binary point.
    
              Integers             Fractions           (fixed point)
    
                   qqqq.                 .qqqq                 q.qqq
              __________            __________            __________
       ssss. / dddddddd.     .ssss / .dddddddd     ss.ss / ddd.ddddd
                   _____                 _____               _______
                   rrrr.             .0000rrrr                .0rrrr
    
    
    
                   qqqq.                 .qqqq                q.qqq
                 * ssss.              *  .ssss          *     ss.ss
               _________              ________            _________
               tttttttt.             .tttttttt            ttt.ttttt
             +     rrrr.          +  .0000rrrr        +      .0rrrr
               _________             _________            _________
               dddddddd.             .dddddddd            ddd.ddddd
    
      for multiply, counting positions from the right, the binary point
      of the product is at the sum of the positions of the multiplicand
      and multiplier.
    
      for divide, counting positions from the right, the binary point
      of the quotient is at the difference of the positions of the
      dividend and divisor. The binary point of the remainder is in
      the same position as the binary point of the dividend.
    
    Overflow occurs when the top half of dividend is greater than or
    equal to the divisor, thus division by zero is always overflow.
    
    
    No schematic or VHDL is provided for restoring division because
    it is never used in practice. The serial non restoring division is:
    
    
    A possible design for a serial divide, does not include remainder correction:
    
    diva    <= hi(30 downto 0) & lo(31) after 50 ps; -- shift
    divb    <= not md when sub_add='1' else md after 50 ps; -- subtract or add
    adder:entity WORK.add32 port map(diva, divb, sub_add, divs, cout);  
    quo     <= not divs(31) after 50 ps; -- quotient bit
    hi      <= divs                  when divclk'event and divclk='1';
    lo      <= lo(30 downto 0) & quo when divclk'event and divclk='1';
    sub_add <= quo                   when divclk'event and divclk='1';
    
    
    
    The full VHDL code is div_ser.vhdl
    with output div_ser.out
    
    Note that the remainder is not corrected by this circuit.
    The  FFFFFFFA should have the divisor 00000007 added to it,
    making the remainder  00000001
    
    
    Now that you understand how binary division works and understand
    how multiplication can be speeded up using parallel circuits,
    we show a parallel division circuit and its simulation.
    
    
    
    divcas4_test.vhdl
    
    divcas4_test.out
    
    Note that the output includes the time.
    Observe the first few lines of printout replacing 'U' undefined,
    meaning not computed, with zeros or ones. Unfortunately, if VHDL
    prints hexadecimal, any state except one is printed as zero.
    
    For  part1  project you are given  divcas16.vhdl
    This divides as 32 bit number by a 16 bit number and
    produces a 16 bit quotient and 16 bit remainder.
    
    divcas16.vhdl
    
    It would be nice if I could have a 4-bit radix 2 or radix 4 SRT
    division schematic here. Parallel circuits that perform division
    may use (-2, -1, 0, 1, 2) values for intermediate signals.
    Two or more bits of the quotient may be computed at each stage,
    based on a table and a few bits of the divisor and partial
    remainder.
    
    SRT Divide, click on slide show .pdf
    SRT Divide .pdf local
    
    freepatentsonline.com/5272660.html
    
    Software can be copyrighted. Just doing a physical embodiment makes
    you the owner of the copyright. Add  Copyright year name  to the
    document or computer file. If you want your copyright to stand up
    in a court of law, you need to file the copyright. Get the latest
    information, at one time there was a $40.00 filing fee and the
    copyright was good for 28 years, renewable for 67 more years, for
    a total of 95 years.
    
    There is a "fair use" clause that allows personal use of parts
    of a copyrighted document.
    
    Software and hardware and processes may be patented. A utility
    patent is good for 20 years, a design patent is good for 14 years.
    The cost of completing the process of getting a patent is variable.
    20 years ago the average cost was $5,000.00 and today the average
    cost is about $15,000.00. There are companies that can help you,
    do-it-yourself, with advertised cost starting from about $1,500.00.
    (There may be additional maintenance fees at 3 1/2 years etc.)
    ((It may take a year or more to get a patent.))
    One version of the process to get a patent is:
    
    
    
    There is no "fair use" clause on patents.
    
    

    Lecture 11, Floating Point

    
    Almost all Numerical Computation arithmetic is performed using
    IEEE 754-1985 Standard for Binary Floating-Point Arithmetic.
    The two formats that we deal with in practice are the 32 bit and
    64 bit formats. You need to know how to get the format you desire
    in the language you are programming. Complex numbers use two values.
    
                                              older
            C       Java    Fortran 95        Fortran    Ada 95         MATLAB
            ------  ------  ----------------  -------    ----------     -------
    32 bit  float   float   real              real       float          N/A
    64 bit  double  double  double precision  real*8     long_float     'default'
    
    complex
    32 bit  'none'  'none'  complex           complex     complex       N/A
    64 bit  'none'  'none'  double complex    complex*16  long_complex  'default'
    
    'none' means not provided by the language (may be available as a library)
    N/A means not available, you get the default.
    
    IEEE Floating-Point numbers are stored as follows:
    The single format 32 bit has
        1 bit for sign,  8 bits for exponent, 23 bits for fraction
    The double format 64 bit has
        1 bit for sign, 11 bits for exponent, 52 bits for fraction
    
    There is actually a '1' in the 24th and 53rd bit to the left
    of the fraction that is not stored. The fraction including
    the non stored bit is called a significand.
    
    The exponent is stored as a biased value, not a signed value.
    The 8-bit has 127 added, the 11-bit has 1023 added.
    A few values of the exponent are "stolen" for
    special values, +/- infinity, not a number, etc.
    
    Floating point numbers are sign magnitude. Invert the sign bit to negate.
    
    Some example numbers and their bit patterns:
    
       decimal
    stored hexadecimal sign exponent  fraction                 significand 
                       bit                                     in binary
                                     The "1" is not stored 
                                     |                                   biased    
                        31  30....23  22....................0            exponent
       1.0
    3F 80 00 00          0  01111111  00000000000000000000000  1.0   * 2^(127-127) 
    
       0.5
    3F 00 00 00          0  01111110  00000000000000000000000  1.0   * 2^(126-127)
    
       0.75
    3F 40 00 00          0  01111110  10000000000000000000000  1.1   * 2^(126-127)
    
       0.9999995
    3F 7F FF FF          0  01111110  11111111111111111111111  1.1111* 2^(126-127)
    
       0.1
    3D CC CC CD          0  01111011  10011001100110011001101  1.1001* 2^(123-127)
     
    
                              63  62...... 52  51 .....  0
       1.0
    3F F0 00 00 00 00 00 00    0  01111111111  000 ... 000  1.0    * 2^(1023-1023)
    
       0.5
    3F E0 00 00 00 00 00 00    0  01111111110  000 ... 000  1.0    * 2^(1022-1023)
    
       0.75
    3F E8 00 00 00 00 00 00    0  01111111110  100 ... 000  1.1    * 2^(1022-1023)
    
       0.9999999999999995
    3F EF FF FF FF FF FF FF    0  01111111110  111 ...      1.11111* 2^(1022-1023)
    
       0.1
    3F B9 99 99 99 99 99 9A    0  01111111011  10011..1010  1.10011* 2^(1019-1023)
                                                                               |
                            sign   exponent      fraction                      |
                                                    before storing subtract bias
    
    Note that an integer in the range 0 to 2^23 -1 may be represented exactly.
    Any power of two in the range -126 to +127 times such an integer may also
    be represented exactly. Numbers such as 0.1, 0.3, 1.0/5.0, 1.0/9.0 are
    represented approximately. 0.75 is 3/4 which is exact.
    Some languages are careful to represent approximated numbers
    accurate to plus or minus the least significant bit.
    Other languages may be less accurate.
    
    /* flt.c  just to look at .o file with hdump */
    void flt()  /* look at IEEE floating point */
    {
      float x1 = 1.0f;
      float x2 = 0.5f;
      float x3 = 0.75f;
      float x4 = 0.99999f;
      float x5 = 0.1f;
    
      double d1 = 1.0;
      double d2 = 0.5;
      double d3 = 0.75;
      double d4 = 0.99999999;                             The "1" not stored
      double d5 = 0.1;                                            in binary
    }                                                            |
                          31  30....23  22....................0  |
      3F 80 00 00          0  01111111  00000000000000000000000  1.0   * 2^(127-127) 
      3F 00 00 00          0  01111110  00000000000000000000000  1.0   * 2^(126-127)
      3F 40 00 00          0  01111110  10000000000000000000000  1.1   * 2^(126-127)
      3F 7F FF 58          0  01111110  11111111111111101011000  1.1111* 2^(126-127)
      3D CC CC CD          0  01111011  10011001100110011001101  1.1001* 2^(123-127)
     
    
                                63  62...... 52  51 .....  0
      3F F0 00 00 00 00 00 00    0  01111111111  000 ... 000  1.0    * 2^(1023-1023)
      3F E0 00 00 00 00 00 00    0  01111111110  000 ... 000  1.0    * 2^(1022-1023)
      3F E8 00 00 00 00 00 00    0  01111111110  100 ... 000  1.1    * 2^(1022-1023)
      3F EF FF FF FA A1 9C 47    0  01111111110  111 ...      1.11111* 2^(1022-1023)
      3F B9 99 99 99 99 99 9A    0  01111111011  1001 ..1010  1.10011* 2^(1019-1023)
                                                                                 |
                              sign   exponent      fraction                      |
                                                                       subtract bias
    
      decimal                     binary fraction / decimal exponent  IEEE normalize
                                                                      binary
    
    
    Now, all the above is the memory, RAM, format.
    Upon a load operation of either float or double into one of the floating point
    registers, the format in the register extended to greater precision
    than double. All floating point arithmetic is performed at this
    greater precision. Upon a store operation, the greater precision is
    reduced to the memory format, possibly with rounding.
    From a programming viewpoint, always use double.
    
    
      exponents must be the same for add and subtract!
    
      A = 3.5 * 10^6              a = 11.1 * 2^6                        1.11 * 2^7
      B = 2.5 * 10^5              b = 10.1 * 2^5                        1.01 * 2^6
    
      A+B       3.50 * 10^6       a+b        11.10 * 2^6               1.110 * 2^7
              + 0.25 * 10^6                +  1.01 * 2^6            +  0.101 * 2^7
              _____________               ______________              ------------
                3.75 * 10^6                 100.11 * 2^6              10.011 * 2^7
                                                           normalize  1.0011 * 2^8
                                                           IEEE
      A-B       3.50 * 10^6
                                                           normalize  0.10011 * 2*9
              - 0.25 * 10^6                                fraction
              -------------
                3.25 * 10^6
    
      A*B       3.50 * 10^6
              * 2.5  * 10^5
              -------------
                8.75 * 10^11
    
      A/B   3.5 *10^6 / 2.5 *10^5 = 1.4 * 10^1
    
    
      
    
      The mathematical basis for floating point is simple algebra
    
      The common uses are in computer arithmetic and scientific notation
    
      given: a number  x1  expressed as 10^e1 * f1
      then  10  is the base, e1 is the exponent and f1 is the fraction
      example  x1 = 10^3 * .1234  means  x1 = 123.4  or  .1234*10^3
      or in computer notation   0.1234E3
    
      In computers the base is chosen to be 2, i.e. binary notation
      for  x1 = 2^e1 * f1 where e1=3 and f1 = .1011
      then x1 = 101.1 base 2 or, converting to decimal x1 = 5.5 base 10
    
      Computers store the sign bit, 1=negative, the exponent and the
      fraction in a floating point word that may be 32 or 64 bits.
    
      The operations of add, subtract, multiply and divide are defined as:
    
      Given   x1 = 2^e1 * f1
              x2 = 2^e2 * f2  and e2 <= e1
    
      x1 + x2 = 2^e1 *(f1 + 2^-(e1-e2) * f2)  f2 is shifted then added to f1
    
      x1 - x2 = 2^e1 *(f1 - 2^-(e1-e2) * f2)  f2 is shifted then subtracted from f1
    
      x1 * x2 = 2^(e1+e2) * f1 * f2
    
      x1 / x2 = 2^(e1-e2) * (f1 / f2)
    
      an additional operation is usually needed, normalization.
      if the resulting "fraction" has digits to the left of the binary
      point, then the fraction is shifted right and one is added to
      the exponent for each bit shifted until the result is a fraction.
      
      We will use fraction normalization, not IEEE normalization:
    
      if the resulting "fraction" has zeros immediately to the right of
      the binary point, then the fraction is shifted left and one is
      subtracted from the exponent for each bit shifted until there
      is a non zero digit to the right of the binary point.
    
      Numeric examples using equations:
           (exponents are decimal integers, fractions are decimal)
           (normalized numbers have  1.0 > fraction >= 0.5)
           (note fraction strictly less than 1.0, greater than or equal 0.5)
     
      x1 = 2^4 * 0.5   or  x1 = 8.0
      x2 = 2^2 * 0.5   or  x2 = 2.0
    
      x1 + x2 = 2^4 * (.5 + 2^-(4-2) * .5) = 2^4 * (.5 + .125) = 2^4 * .625
    
      x1 - x2 = 2^4 * (.5 - 2^-(4-2) * .5) = 2^4 * (.5 - .125) = 2^4 * .375 
           not normalized, multiply fraction by 2, subtract 1 from exponent 
                                           = 2^3 * .75
    
      x1 * x2 = 2^(4+2) * (.5*.5) = 2^6 * .25   not normalized
                                  = 2^5 * .5    normalized
    
      x1 / x2 = 2^(4-2) * (.5/.5) = 2^2 * 1.0    not normalized
                                  = 2^3 * .5     normalized
    
    
      Numeric examples, people friendly:
            (exponents are decimal integers, fractions are decimal)
            (normalized numbers have  1.0 > fraction >= 0.5)
    
      x1 = 0.5 * 2^4 
      x2 = 0.5 * 2^2  
    
      x1 + x2 =   0.500 * 2^4
                + 0.125 * 2^4  unnormalize to make exponents equal
                  -----------
                  0.625 * 2^4  result is normalized, done.
    
      x1 - x2 =   0.500 * 2^4
                - 0.125 * 2^4  unnormalize to make exponents equal
                  -----------
                  0.375 * 2^4  result is not normalized
                  0.750 * 2^3  double fraction, halve exponential
    
      x1 * x2 = 0.5 * 0.5 * 2^2 * 2^4 = 0.25 * 2^6   not normalized
                                      = 0.5  * 2^5   normalized
    
      x1 / x2 = (.5/.5) * 2^4/2^2 = 1.0 * 2^2    not normalized
                                  = 0.5 * 2^3    normalized
                                                 halve fraction, double exponential
    
    
    IEEE 754 Floating Point Standard
    
    A few minor problems, e.g. the square root of all complex numbers
    are in the right half of the complex plane and thus the real
    part of the square root should never be negative. As a concession
    to early hardware, the standard define the sqrt(-0) to be -0
    rather than +0. Several places the standard uses the word should.
    If a standard is specifying something, the word shall is typically used.
    
    Basic decisions and operations for floating point add and subtract:
    
    
    
    The decisions indicated above could be used to design the control
    component shown in the data path diagram below:
    
    
    
    
    A hint on normalization, using computer scientific notation:
    
    1.0E-8 == 10.0E-9 == 0.01E-6  == 0.00000001 == 10ns == 0.01 microseconds
    
    1.0E8  ==  0.1E9  == 100.0E6  == 100,000,000 == 100MHz == 0.1 GHz
    
    1.0/1.0GHz = 1ns clock period 
    
    
    Some graphics boards have large computing capacity and
    some are releasing the specs so programmers can use the
    computing capacity.
    
    nVidia example 2007
    
    512-core by 2011, more today
    
    Programming 512 cores or more with CUDA or OpenCL is quite a challenge.
    New languages are coming, not optimized yet.
    
    Fortunately, CMSC 411 does not require VHDL for floating point,
    just the ability to manually do floating point add, subtract,
    multiply and divide. (Examples above and in class on board.)
    
    

    Lecture 12, VHDL - circuits and debugging

    
    
      Debugging VHDL (or almost any computer input)
    
      1) Expect errors. Nobody's perfect.
    
    
      2) Automate to make it easy to re-run, e.g. Makefile_411 or Makefile_ghdl
      for HW4, you may use either or both.
      
            make -f Makefile_411 tadd32.out    # cadence
            diff -iw tadd32.out tadd32.chk
            make -f Makefile_ghdl tadd32.gout  # GHDL  diff in Makefile_ghdl
            diff -iw tadd32.gout tadd32.chkg
    
    The .out and .gout differ in extra lines, vhdl output should be the same.
      
         Use Makefile or do a lot of typing:  for cadence
      
    	run_ncvhdl.bash -v93 -messages -linedebug -cdslib ~/cs411/vhdl2/cds.lib -hdlvar ~/cs411/vhdl2/hdl.var -smartorder add32.vhdl tadd32.vhdl
    	run_ncelab.bash -v93 -messages -access rwc -cdslib  ~/cs411/vhdl2/cds.lib -hdlvar ~/cs411/vhdl2/hdl.var tadd32
    	run_ncsim.bash -input tadd32.run -batch -logfile tadd32.out -messages -cdslib ~/cs411/vhdl2/cds.lib -hdlvar  ~/cs411/vhdl2/hdl.var tadd32
    
    
         Use Makefile or do a lot of typing: for GHDL
    
    	ghdl -a --ieee=synopsys add32.vhdl
    	ghdl -a --ieee=synopsys tadd32.vhdl
    	ghdl -e --ieee=synopsys tadd32
    	ghdl -r --ieee=synopsys tadd32 --stop-time=65ns > tadd32.gout
            diff -iw tadd32.gout tadd32.chkg
    
      3) for rest  HW6, part1, part2a, part2b, part3a, part3b
         HW6
            make -f Makefile_411 pmull16_test.out    # cadence
            diff -iw pmul16_test.out pmul16.chk
            make -f Makefile_ghdl tadd32.gout        # GHDL  
            diff -iw pmul16_test.gout pmul16.chkg
      
         part1
            make -f Makefile_411 part1.out    # cadence
            diff -iw part1.out part1.chk
            make -f Makefile_ghdl part1.gout  # GHDL
            diff -iw part1.gout part1.chkg
      
         part2a
            make -f Makefile_411 part2a.out    # cadence
            diff -iw part2a.out part2a.chk
            make -f Makefile_ghdl part2a.gout  # GHDL
            diff -iw part2a.gout part2a.chkg
      
         part2b
            make -f Makefile_411 part2b.out    # cadence
            diff -iw part2b.out part2b.chk
            make -f Makefile_ghdl part2b.gout  # GHDL
            diff -iw part2b.gout part2b.chkg
      
         part3a
            make -f Makefile_411 part3a.out    # cadence
            diff -iw part3a.out part3a.chk
            make -f Makefile_ghdl part3a.gout  # GHDL
            diff -iw part3a.gout part3a.chkg
      
         part3b
            make -f Makefile_411 part3b.out    # cadence
            diff -iw part3b.out part3b.chk
            make -f Makefile_ghdl part3b.gout  # GHDL
            diff -iw part3b.gout part3b.gchk
      
    
      
      4) FIX THE FIRST ERROR !!!!
         Yes, you can fix other errors also, but one error can cause
         a cascading effect and produce many errors.
    
         Don't panic when there was only one error, you fixed that,
         then the next run you get 37 errors. The compiler has stages,
         it stops on a stage if there is an error. Fixing that error
         lets the compiler move to the next stage and check for other
         types of errors. Go to step 3)
    
    
      5) Don't give up. Don't make wild guesses. Do experiment with
         one change at a time. You may actually have to read some
         of the lectures  :)
    
    
      6) Your circuit compiles and simulates but the output is not
         correct. Solution: find first difference, or add debug print.
         OK to put in debug printout, remove or comment out before submit.
    
         Most circuits in this course have a print process. You can
         easily add printout of more signals. Look for the existing
         code that has 'write' and 'writeline' statements.
         To print out some signal, xxx, after a 'writeline' statement add
    
               write(my_line, string'("  xxx=")); -- label printout
               hwrite(my_line, xxx);              -- hex for long signals
               write(my_line, string'("  enb="));
               write(my_line, enb);               -- bit for single values
               writeline(output, my_line);        -- outputs line
    
    
      7) You have a signal, xxx, that seems to be wrong and you can not
         find when it gets the wrong value. OK, create a new process to
         print every change and when it occurs.
    
         prtxxx: process (xxx)
                   variable my_line : LINE; -- my_line needs to be defined
                 begin
                   write(my_line, string'("xxx="));
                   write(my_line, xxx);         -- or hwrite for long signals
                   write(my_line, string'(" at="));
                   write(my_line, now);         -- "now" is simulation time
                   writeline(output, my_line);  -- outputs line
                 end process prtxxx;
    
         When adding 'write' statements, you may need to add the
         context clause in front of the enclosing design unit. e.g.
            library STD;
            use STD.textio.all; -- defines LINE, writeline, etc.
            library IEEE;
            use IEEE.std_logic_1164.all;
            use IEEE.std_logic_textio.all; -- defines write on std_logic (_vector)
    
    
      8) Read your code.
         Every identifier must be declared before it is used.
         Every signal MUST be set exactly once, e.g.
             xxx <= a;
             xxx <= b; -- somewhere else, BAD !
                       -- all hardware runs all the time
                       -- the ordering of some statements does not matter
    
             a0: fadd port map(a(0), b(0), cin , sum(0), c(0));
             a1: fadd port map(a(1), b(1), c(0), sum(1), c(0));
                                                         ####    BAD !
    
        Signals must match in type and size. An error having 
        "shape mismatch" means incompatible size. You can not put
        one bit into a 32 bit signal nor 32 bits into a one bit signal.
        "...type... error" Are you putting an integer into a std_logic?
        You can not put an identifier of type std_logic into
        std_logic_vector.  a(31 downto 28) is of type std_logic_vector,
        a(31) is of type std_logic.
    
    
    Everywhere a specific signal name is used, these points are
    wired together. For VHDL simulation purposes, all points on a
    wire always have exactly the same value. Zero propagation delay
    through a wire. Be careful what you wire together. Use the VHDL
    reserved word 'open' for open circuits rather than NC for
    no connection.
    
    
    
    ncsim: 05.40-s011: (c) Copyright 1995-2005 Cadence Design Systems, Inc.
    ncsim> run 7 ns
    A= 1  B= 1  C= U  D= U  CNC= U  DNC= U  NC= U at time 0 ns
    A= 1  B= 1  C= U  D= U  CNC= U  DNC= U  NC= U at time 1 ns
    A= 1  B= 1  C= 1  D= U  CNC= U  DNC= U  NC= U at time 2 ns
    A= 1  B= 1  C= 1  D= U  CNC= U  DNC= U  NC= U at time 3 ns
    A= 1  B= 1  C= 1  D= 1  CNC= U  DNC= U  NC= U at time 4 ns
    A= 1  B= 1  C= 1  D= 1  CNC= U  DNC= U  NC= U at time 5 ns
    A= 1  B= 1  C= 1  D= 1  CNC= U  DNC= U  NC= U at time 6 ns
    Ran until 7 NS + 0
    ncsim> exit
                                !!!     !!! never set due to connection
    
    
    -- use_open.vhdl
    library IEEE;
    use IEEE.std_logic_1164.all;
    
    entity AN is 
      port(IN1  : in  std_logic;
           IN2  : in  std_logic;
           OUTB : inout std_logic; -- because used internally, bad design
           OUTT : out std_logic);
    end entity AN;
    
    architecture circuits of AN is
    begin  -- circuits
      OUTB <= IN1 nand IN2 after 1 ns;
      OUTT <= not OUTB     after 1 ns;
    end architecture circuits;  -- of AN
    
    
    library IEEE;
    use IEEE.std_logic_1164.all;
    use STD.textio.all;
    use IEEE.std_logic_textio.all;
    
    entity use_open is 
    end entity use_open;
    
    architecture circuits of use_open is
      signal A : std_logic := '1';
      signal B : std_logic := '1';
      signal C, CNC : std_logic;
      signal D, DNC : std_logic;
      signal NC : std_logic := '1'; -- for no connection or tied off
    begin
      my_print : process is
                   variable my_line : line;
                 begin
                   write(my_line, string'("A= "));
                   write(my_line, A);
                   write(my_line, string'("  B= "));
                   write(my_line, B);
                   write(my_line, string'("  C= "));
                   write(my_line, C);
                   write(my_line, string'("  D= "));
                   write(my_line, D);
                   write(my_line, string'("  CNC= "));
                   write(my_line, CNC);
                   write(my_line, string'("  DNC= "));
                   write(my_line, DNC);
                   write(my_line, string'("  NC= "));
                   write(my_line, NC);
                   write(my_line, string'(" at time "));
                   write(my_line, now);
                   writeline(output, my_line);
                   wait for 1 ns;
                 end process my_print;
     
      n01: entity WORK.AN port map(A, B, open, C);
      n02: entity WORK.AN port map('1', C, open, D);
      n03: entity WORK.AN port map(A, B, NC, CNC);
      n04: entity WORK.AN port map('1', CNC, NC, DNC);
    
    end architecture circuits; -- of use_open
    
    Truth tables using type std_logic
    
    t_table.vhdl
    
    Now, some Cadence VHDL error messages.
    
    -- error.vhdl   demonstrate VHDL compiler error messages
    
    library IEEE;
    use IEEE.std_logic_1164.all;
    
    entity AN is 
      port(IN1  : in  std_logic;
           IN2  : in  std_logic;
           OUTB : inout std_logic; -- because used internally
           OUTT : out std_logic;);
    end entity AN;
    
    architecture circuits of AN is
      signal aaa : std_logic;
    begin  -- circuits
      OUTB <= aa and IN1 and IN2 after 1 ns;
      OUTT <= not OUTB     after 1 ns;
    end architecture circuits;  -- of AN
    
    old output:
    ncvhdl: 05.40-s011: (c) Copyright 1995-2005 Cadence Design Systems, Inc.
           OUTT : out std_logic;);
                                |
    ncvhdl_p: *E,PORNKW (error.vhdl,10|28): identifier expected.
           OUTT : out std_logic;);
                                |
    ncvhdl_p: *E,MISCOL (error.vhdl,10|28): expecting a colon (':') 87[4.3.3] 93[4.3.2].
           OUTT : out std_logic;);
                                   |
    ncvhdl_p: *E,PORNKW (error.vhdl,10|31): identifier expected.
           OUTT : out std_logic;);
                                   |
    ncvhdl_p: *E,MISCOL (error.vhdl,10|31): expecting a colon (':') 87[4.3.3] 93[4.3.2].
    end entity AN;
                 |
    ncvhdl_p: *E,EXPRIS (error.vhdl,11|13): expecting the reserved word 'IS' [1.1].
      OUTB <= aa and IN1 and IN2 after 1 ns;
               |
    ncvhdl_p: *E,IDENTU (error.vhdl,16|11): identifier (AA) is not declared [10.3].
    
    
    
    Now you are ready to tackle Homework 6
    
    To simplify
    
    
    
    sqrt examples of simplify
    
    

    Lecture 13, Microprogramming - review

    
    Review is paper handout. (not for online classes, open web)
    Following, microcontrollers, microprogramming and 64-bit.
    
    A microcontroller may be a very small and inexpensive device.
    The basic parts are Combinational Logic, logic gates, and some
    type of storage, Sequential Logic.
    
    
    
    
    For students who have taken CMSC 451, this is the classic
    Deterministic Automata, a Finite State Machine.
    
    
    A microcontroller may have Read Only Memory, ROM, that contains
    a microprogram to run the microcontroller. Micro assemblers and
    micro compilers may be used to generate the microprogram. The
    microprogram is manufactured in the microcontroller.
    
    Micro instructions may be very long, 40 to 64 bits is common.
    Often there are bits to directly control multiplexors.
    Often there are groups of bits to directly control other
    units such as the ALU.
    There may be bits that go directly to outputs.
    Every microinstruction may have a jump address.
    The jump may be a conditional branch based on some state bits.
    
    
    
    
    
    From Wikipedia wiki/Microcode
    
    
    Terminology: Combinational Logic is just gates. No storage.
                 Sequential Logic has storage, flipflop(s) or register(s).
                 A flipflop or register holds the output until changed.
                 A flipflop or register will have a clock and data
                 will only be input to change state on a clock edge.
                 There may be a clear or set input that does not
                 need a clock signal, typically used to initialize
                 a logic circuit to a known state.
    
    
    This lecture also covers 64-bit machines (If not covered earlier)
    
    A 64-bit architecture, by definition, has 64-bit integer registers.
    Many computers have had 64-bit IEEE floating point for many years.
    The 64-bit machines have been around for a while as the Alpha and
    PowerPC yet have become popular for the desktop with the Intel and
    AMD 64-bit machines.
    
    
    
    Software has been dragging well behind computer architecture.
    The chaos started in 1979 with the following "choices."
    
    
    
    The full whitepaper www.unix.org/whitepapers/64bit.html
    
    My desire is to have the compiler, linker and operating system be ILP64.
    All my code would work fine. I make no assumptions about word length.
    I use sizeof(int)  sizeof(size_t) etc. when absolutely needed.
    On my 8GB computer I use a single array of over 4GB thus the subscripts
    must be 64-bit. The only option, I know of, for gcc is  -m64 and that
    just gives LP64. Yuk! I have to change my source code and use "long"
    everywhere in place of "int". If you get the idea that I am angry with
    the compiler vendors, you are correct!
    
    Here are sample programs and output to test for 64-bit capability in gcc:
    
    Get sizeof on types and variables big.c
    
    output from  gcc -m64 big.c  big.out
    
    malloc more than 4GB  big_malloc.c
    
    output from  big_malloc_mac.out
    
    Newer Operating Systems and compilers
    Get sizeof on types and variables big12.c
    
    output from  gcc big12.c  big12.out
    
    
    
    The early 64-bit computers were:
    
    DEC Alpha
    
    DEC Alpha
    
    IBM PowerPC note 5 clocks, similar to project
    
    

    review for midterm, handout

    Lecture 14, mid-term exam

      open book, open note, download, edit, submit
      OK to scp to windows and use Microsoft Word, scp back, submit.
      OK to use  libreoffice  on gl.umbc.edu  and submit
      Edit by placing  X  after  a)  b)  c) ...
      Also OK to highlight answer.
      Only one answer per question!
      
      Students with email user name starting  a b c d e f g h i
      download and edit  midterm33a.doc
      download midterm33a.doc 
    
    
      Students with email user name starting  j k l m n o p q
      download and edit  midterm33b.doc
      download midterm33b.doc 
    
    
      Students with email user name starting  r s t u v w x y z
      download and edit  midterm33c.doc
      download midterm33c.doc 
    
      Follow instructions in exam, edit, then
      submit  cs411  midterm  midterm33?.doc 
    
      You can do the exam on linux.gl.umbc.edu in your directory
      using libreoffice midterm33?.doc
      
      cp /afs/umbc.edu/users/s/q/squire/pub/download/midterm33?.doc .
      libreoffice midterm33?.doc
      submit cs411 midterm midterm33?.doc
      rm midterm33?.doc  only if over quota
    
      
      Before Exam:
      Review HW2, HW3, HW4 (VHDL) and HW5
      Review WEB Lecture Notes 1 through 13.
    
      There are  10  types of people:
        Those who know binary.
        Those who do not know binary.
    
      Teach your children to count in the computer age:
        zero
        one
        two
        three
        four
    
      Computer bits are numbered from the bottom
    
        0  0  1  0  1  = 5
        4  3  2  1  0    bit numbers (actually powers of 2)
    
    
    Last update 9/9/2020

    Lecture 15, Control Unit

    We now start the second half of the semester, focusing on
    the five part project to simulate part of a real computer.
    Note that the hardware does not change. Only multiplexer
    control signals are needed to execute various instructions.
    
    The first complete computer architecture is a single cycle design.
    On each clock cycle this computer executes one instruction. CPI=1
    (The clock would be slow compared to pipeline computers in the
     next lecture.)
    
    Signals are inputs to components on the left and outputs of
    components on the right. Wide lines are 32-bits. Narrow signals
    are one-bit unless otherwise indicated.
    
    
    
    
    Every clock, we use the rising edge, the program counter register, PC,
    takes the 32 bit input from the left most signal on the diagram. The
    output of the PC is a memory address for an instruction.
    
    The 32 bit instruction is "decoded" by routing various parts of the
    instruction to various places.
    Bits 31 downto 26 of the instruction go to the control unit. 
    (The schematic of the control unit is shown below.)
    Bits 10 downto 0 of the instruction go to the ALU, the shift count and
    the ALU op code.
    Bits 25 downto 21 are a register address that is read and the 32 bit
    contents of that register are placed on read data 1.
    Bits 20 downto 16 are a register address that is read and the 32 bit
    contents of that register are placed on read data 2.
    Bits 15 downto 11 are a register address that may be written with the
    32 bit write data.
    Bits 25 downto 0 go to the  jump  address computation.
    
    
    The sequence of diagrams that follow will show the control signals
    and the data paths for various instructions.
    The bit patterns for our CMSC 411 machine are cs411_opcodes.txt
    inside the ALU entity
    
    The first instruction is the  nop  instruction.
    This instruction shows the basic updating of the PC, while changing
    no other registers or memory. All other instructions shown below,
    except  branch  and  jump , use this updating of the PC.
    
    

    nop

    The PC plus 4 is the next sequential instruction address. The 32 bit instruction has four bytes. The bottom two bits of all instruction addresses are zero. The instructions are "aligned." The critical control signals are: jump 0 branch 0 MemWrite 0 RegWrite 0 The other control signals are shown for completeness. The next instruction, jump, is just slightly more complex than nop. The bit pattern for jump in cs411_opcodes.txt

    jump

    Note the wiring where instruction bits 25 downto 0 are shifter left two places. This provides a larger jump range and aligns the address on a quad byte boundary. The top four bits come from the incremented PC and the resulting 32 bit address is routed through the multiplexer back to the PC, ready for the next clock. The critical control signals are: jump 1 MemWrite 0 RegWrite 0 The other control signals are shown for completeness. The next instruction, branch , uses the remainder of the upper schematic to compute a new instruction address relative to the incremented PC. Note that the assembler subtracts 4 from the branch address before generating the machine instruction. The bit pattern for beq in cs411_opcodes.txt

    branch

    Note the equal comparator immediately next to the registers. This is the design we will use in the project because it provides better performance in the pipeline architecture. If the branch condition is not satisfied, the instruction becomes a nop . The branch condition for beq is that the contents of the registers are the same and a beq instruction is executing. Note the and gate driving the multiplexer. The critical control signals are: jump 0 branch 1 and the equal comparison MemWrite 0 RegWrite 0 The other control signals are shown for completeness. The add instruction is shown with just the data paths and control paths for the instruction shown. The upper control to increment the PC is the same as shown for the nop instruction. The bit pattern for add in cs411_opcodes.txt

    add

    The contents of two registers are combined in the ALU. The ALU op code in the instruction bits 5 downto 0 would have 100000 for add . Other instructions such as subtract, shift, and, etc follow the same data paths and control, executing the instruction coded in the instruction bits 5 downto 0. The output of the ALU is routed back to the registers and written on the falling edge of the clock, clk. The critical control signals are: jump 0 branch 0 MemtoReg 0 MemWrite 0 Aluop 1 ALUSrc 0 RegWrite 1 RegDst 1 The other control signals are shown for completeness. The load word, lw , instruction computes a memory address using the twos complement offset in the instruction bits 15 downto 0, sign extended to 32 bits and added to a register. The memory is read and the contents from memory is routed through the multiplexer into the destination register. The PC is incremented as shown in the nop instruction. The bit pattern for lw in cs411_opcodes.txt

    load word, lw

    The critical control signals are: jump 0 branch 0 MemtoReg 1 MemRead 1 MemWrite 0 Aluop 0 the ALU performs an add when Aluop is zero ALUSrc 1 RegWrite 1 RegDst 0 The other control signals are shown for completeness. The store word, sw , instruction computes a memory address using the twos complement offset in the instruction bits 15 downto 0, sign extended to 32 bits and added to a register. The read data 2 is stored in memory. The PC is incremented as shown in the nop instruction. The bit pattern for sw in cs411_opcodes.txt

    store word, sw

    Note the data path around the ALU into the write data input to the memory The critical control signals are: jump 0 branch 0 MemRead 0 MemWrite 1 Aluop 0 the ALU performs an add when Aluop is zero ALUSrc 1 RegWrite 0 The other control signals are shown for completeness. The add immediate, addi , instruction adds the twos complement bits 15 downto 0 of the instruction to a register and places the sum into the destination register. The PC is incremented as shown in the nop instruction. The bit pattern for addi in cs411_opcodes.txt

    add immediate, addi

    The critical control signals are: jump 0 branch 0 MemtoReg 0 MemWrite 0 Aluop 0 the ALU performs an add when Aluop is zero ALUSrc 1 RegWrite 1 RegDst 0 The other control signals are shown for completeness. The control schematic for some specific instructions, possibly not this semester, for the one cycle architecture, is: The shift left 2 circuit is just bent wires. The VHDL is output <= input(29 downto 0) & "00"; The sign extend circuit is just wiring. The input is a 16 bit twos complement word and outputs a 32 bit twos complement word. The VHDL is output(15 downto 0) <= input; output(31 downto 16) <= (others => input(15)); cs411_opcodes.txt different from Computer Organization and Design 1/8/2020 rd is register destination, the result, general register 1 through 31 rs is the first register, A, source, general register 0 through 31 rt is the second register, B, source, general register 0 through 31 --val---- generally a 16 bit number that gets sign extended --adr---- a 16 bit address, gets sign extended and added to (rx) "i" is generally immediate, operand value is in the instruction Opcode Operands Machine code format 6 5 5 5 5 6 number of bits in field 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 | | | | | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 nop 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 1 0 0 0 0 0 add r,a,b 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 1 0 0 0 1 0 sub r,a,b 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 0 1 1 0 0 0 mul r,a,b 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 0 1 1 0 1 1 div r,a,b 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 0 0 1 1 0 1 and r,a,b 0 0 0 0 0 0 a a a a a b b b b b r r r r r -ignored- 0 0 1 1 1 1 or r,a,b 0 0 0 0 0 0 0 0 0 0 0 b b b b b r r r r r s s s s s 0 0 0 0 1 1 srl r,b,s 0 0 0 0 0 0 0 0 0 0 0 b b b b b r r r r r s s s s s 0 0 0 0 1 0 sll r,b,s 0 0 0 0 0 0 0 0 0 0 0 b b b b b r r r r r -ignored- 0 0 1 0 1 1 cmpl r,b 0 0 0 0 1 0 -----address to bits (27:2) of PC------------------ j adr 0 0 1 1 1 1 x x x x x r r r r r ---2's complement value-------- lwim r,val(x) 0 0 1 1 0 0 x x x x x r r r r r ---2's complement value-------- addi r,val(x) 0 1 1 1 0 1 a a a a a b b b b b ---2's complement address------ beq a,b,adr 1 0 0 0 1 1 x x x x x r r r r r ---2's complement address------ lw r,adr(x) 1 0 1 0 1 1 x x x x x b b b b b ---2's complement address------ sw b,adr(x) Definitions: nop no operation, no programmer visible registers or memory are changed, except PC <= PC+4 j adr bits 0 through 25 of the instruction are inserted into PC(27:2) probably should zero bits PC(1:0) but should be zero already lw r,adr(x) load word into register r from memory location (register x plus sign extended adr field) sw b,adr(x) store word from register b into memory location (register x plus sign extended adr field) beq a,b,adr branch on equal, if the contents of register a are equal to the contents of register b, add the, shifted by two, sign extended adr to the PC (The PC will have 4 added by then) lwim r,val(x) add immediate, the contents of register x is added to the sign extended value and the result put into register r addi r,val(x) add immediate, the contents of register x is added to the sign extended value and the result is added to register r add r,a,b add register a to register b and put result into register r sub r,a,b subtract register b from register a and put result into register r mul r,a,b multiply register a by register b and put result into register r div r,a,b divide register a by register b and put result into register r and r,a,b and register a to register b and put result into register r or r,a,b or register a to register b and put result into register r srl r,b,s shift the contents of register b by s places right and put result in register r sll r,b,s shift the contents of register b by s places left and put result in register r cmpl r,b one's complement of register b goes into register r Also: no instructions are to have side effects or additional "features" last updated 1/8/2020 (slight difference in opcodes from previous semesters)

    Lecture 16, Pipelining 1

    First, a few definitions:
    
    Pipelining : Multiple instructions being executed, each in a different
                 stage of their execution. A form of parallelism.
    
    Super Pipelining : Advertising term, just longer pipelines.
    
    Super Scalar : Having multiple ALU's. There may be a mix of some
                   integer ALU's and some Floating Point ALU's.
    
    Multiple Issue : Starting a few instructions every clock.
                     The CPI can be a fraction, 4 issue gives a CPI of 1/4 .
    
    Dynamic Pipeline : This may include all of the above and also can
                       reorder instructions, use data forwarding and
                       hazard workarounds.
    
    Pipeline Stages : For our study of the MIPS architecture,
                      IF   Instruction Fetch stage
                      ID   Instruction Decode stage
                      EX   Execute stage
                      MEM  Memory access stage
                      WB   Write Back into register stage
    
    Hyper anything : Generally advertising terminology.
    
    Consider the single cycle machine in the previous lecture.
    The goal is to speed up the execution of programs, long sequences
    of instructions. Keeping the same manufacturing technology, we can
    look at speeding up the clock by inserting clocked registers at
    key points. Note the placement of blue registers that tries to
    minimize the gate delay time between any pair of registers.
    Thus, allowing a faster clock.
    
    
    
    
    This is called approximate because some additional design must
    be performed, mostly on "control", that must now be distributed.
    The next step in the design, for our project, is to pass the
    instruction along the pipeline and keep the design of each
    stage of the pipeline simple, just driven by the instruction
    presently in that stage.
    
    
    
    pipe1.vhdl implementation moves instruction
                note clock and reset generation
                look at register behavioral implementation
                instruction memory is preloaded
    
    pipe1.out just numbers used for demonstration
    
    
    

    Pipelined Architecture with distributed control

    pipe2.vhdl note additional entities equal6 for easy decoding data memory behavioral implementation pipe2.out instructions move through stages

    Timing analysis

    Consider four instructions being executed. First on the single cycle architecture, needing 8ns per instruction. The time for each part of the circuit is shown. The clock would be: +---------------+ +---------------+ +------ | | | | | -+ +---------------+ +---------------+ Single cycle execution 125MHZ clock 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17ns | | | | | | | | | | | | | | | | | | +-------+---+-------+-------+---+ |IF |ID | EX | MEM |WB | +-------+---+-------+-------+---+ +-------+---+-------+-------+---+ |IF |ID | EX | MEM |WB | +-------+---+-------+-------+---+ +--- |IF ... 24ns +--- ... 32ns The four instructions finished in 32ns. An instruction started every 8ns. An instruction finished every 8ns. Now, the pipelined architecture has the clock determined by the slowest part between clocked registers. Typically, the ALU. Thus use the same ALU time as above, the clock would be: +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | | | | | | | | | | | | | | | | | | -+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +- Pipelined Execution 500MHZ clock ** +-------+-------+-------+-------+-------+ |IF |ID reg| EX | MEM |reg WB | +-------+-------+-------+-------+-------+ +-------+-------+-------+-------+-------+ |IF |ID reg| EX | MEM |reg WB | +-------+-------+-------+-------+-------+ +-------+-------+-------+-------+-------+ |IF |ID reg| EX | MEM |reg WB | +-------+-------+-------+-------+-------+ +-------+-------+-------+-------+-------+ |IF |ID reg| EX | MEM |reg WB | +-------+-------+-------+-------+-------+ ** | | | | | | | | | | | | | | | | | | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17ns The four instructions finished in 16ns. (But, the speedup is not 2) An instruction started every 2ns. An instruction finished every 2ns. Thus, the speedup is 8ns/2ns = 4 . Since an instruction finishes every 2ns for the pipelined architecture and every 8ns for the single cycle architecture, the speedup will be 8ns/2ns = 4. The speedup would change with various numbers of instructions if the total time was used. Thus, the time between the start or end of adjacent instructions is used in computing speedup. Note the ** above in the pipeline. The first of the four instructions may load a value in a register. This load takes place on the falling edge of the clock. The fourth instruction is the earliest instruction that could use the register loaded by the first instruction. The use of the register comes after the rising edge of the clock. Thus use of both halves of the clock cycle is important to this architecture and to many modern computer architectures. Remember, every stage of the pipeline must be the same time duration. The system clock is used by all pipeline registers. The slowest stage determines this time duration and thus determines the maximum clock frequency. The worse case delay that does not happen often because of optimizing compilers, is a load word, lw, instruction followed by an instruction that needs the value just loaded. The sequence of instructions, for this unoptimized architecture, would be: lw $1,val($0) load the 32 bit value at location val into register 1 nop nop addi $2,21($1) register 1 is available, add 21 and put result into reg 2 As can be seen in the pipelined timing below, lw would load register 1 by 9ns and register 1 would be used by addi by 10ns (**). The actual add would be finished by 12 ns and register 2 updated sum by 15 ns (***). +-------+-------+-------+-------+-------+ lw $1,val($0)|IF |ID reg| EX | MEM |reg WB | +-------+-------+-------+-------+-------+ +-------+-------+-------+-------+-------+ nop |IF |ID reg| EX | MEM |reg WB | +-------+-------+-------+-------+-------+ +-------+-------+-------+-------+-------+ nop |IF |ID reg| EX | MEM |reg WB | +-------+-------+-------+-------+-------+ +-------+-------+-------+-------+-------+ addi $2,21($1) |IF |ID reg| EX | MEM |reg WB | +-------+-------+-------+-------+-------+ ** *** | | | | | | | | | | | | | | | | | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ns It is interesting to note some similarity to an IBM Power PC that came a few years after the MIPS R3000 architecture that is similar to the above design. IBM Power PC stages and clock usage new IBM Power PC Shipped 2012 at 5.5Ghz

    Lecture 17, Pipelining 2

    
    The pipeline for this course with branch and jump optimized:
        project part2a  adds data forwarding
        project part2b  adds stall
        project part3a  adds cache for instructions
        project part3b  adds cache for data
    
    
    
      Note the three input mux replacing two mux in previous lecture.
    
      Note the distributed control using the  equal6  entity:
      eq6j: entity WORK.equal6 port map(ID_IR(31 downto 26), "000010", jump);
            jumpaddr <= PCP(31 downto 28) & ID_IR(25 downto 0) & "00";
     
      cs411_opcodes.txt look at jump
    
    
    In a later lecture, we will cover data forwarding to avoid nop's in
    arithmetic and automatic stall to avoid putting all nop's in source code.
    
    For the basic machine above, we have the timing shown here.
    
    The branch slot, programming to avoid delays (filling in nop's):
    Note: beq and jump always execute the next physical instruction.
          This is called the "delayed branch slot", important for HW7.
    
        if(a==b)  x=3; /* simple C code */
        else      x=4;
        y=5;
    
           lw   $1,a       # possible unoptimized assembly language
           lw   $2,b       # no ($0) shown on memory access
           nop             # wait for b to get into register 2
           nop             # wait for b to get into register 2
           beq  $1,$2,lab1
           nop             # branch slot, always executed *********
           addi $1,4       # else part
           nop             # wait for 4 to get into register 1
           nop             # wait for 4 to get into register 1
           sw   $1,x       # x=4;
           j    lab2
           nop             # branch slot, always executed *********
    lab1:  addi $1,3       # true part
           nop             # wait for 3 to get into register 1
           nop             # wait for 3 to get into register 1
           sw   $1,x       # x=3;
    lab2:  addi $1,5       # after if-else, always execute
           nop             # wait for 5 to get into register 1
           nop             # wait for 5 to get into register 1
           sw   $1,y       # y=5;
    
    Unoptimized, 20 instructions. This code needed for project part1
    
    Now, a smart compiler would produce the optimized code:
    
           lw   $1,a       # possible unoptimized assembly language
           lw   $2,b       # no ($0) shown on memory access
           addi $4,4       # for else part later
           addi $3,3       # for true part later
           beq  $1,$2,lab1
           addi $5,5       # branch slot, always executed, for after if-else
           j    lab2
           sw   $4,x       # x=4; in branch slot, always executed !! after jump
    lab1:  sw   $3,x       # x=3;
    lab2:  sw   $5,y       # y=5;
    
    Optimized, 10 instructions. This code needed for project part2b
    
    
    The pipeline stage diagram for a==b true is:
                        1  2  3  4  5  6  7  8  9 10 11 12  clock
       lw   $1,a       IF ID EX MM WB
       lw   $2,b          IF ID EX MM WB
       addi $4,4             IF ID EX MM WB
       addi $3,3                IF ID EX MM WB
       beq  $1,$2,L1               IF ID EX MM WB     assume equal, branch to L1
       addi $5,5                      IF ID EX MM WB  delayed branch slot
       j    L2
       sw   $4,x       
    L1:sw   $3,x                         IF ID EX MM WB
    L2:sw   $5,y                            IF ID EX MM WB
                        1  2  3  4  5  6  7  8  9 10 11 12
    
    The pipeline stage diagram for a==b false is:
                        1  2  3  4  5  6  7  8  9 10 11 12 13  clock
       lw   $1,a       IF ID EX MM WB
       lw   $2,b          IF ID EX MM WB
       addi $4,4             IF ID EX MM WB
       addi $3,3                IF ID EX MM WB
       beq  $1,$2,L1               IF ID EX MM WB     assume not equal
       addi $5,5                      IF ID EX MM WB 
       j    L2                           IF ID EX MM WB  jumps to L2
       sw   $4,x                            IF ID EX MM WB
    L1:sw   $3,x       
    L2:sw   $5,y                               IF ID EX MM WB
                        1  2  3  4  5  6  7  8  9 10 11 12 13
    
        if(a==b)  x=3; /* simple C code */
        else      x=4;
        y=5;
    
    
    Renaming when there are extra registers that the programmer can
    not assess (diagram in Alpha below) with multiple units there can be
    multiple issue (parallel execution of instructions) 
    
    The architecture sees the binary instructions from the following:
    
       lw   $1,a
       lw   $2,b
       nop
       sll  $3,$1,8
       sll  $6,$2,8
       add  $9,$1,$2
       sw   $3,c
       sw   $6,d
       sw   $9,e
       lw   $1,aa
       lw   $2,bb
       nop
       sll  $3,$1,8
       sll  $6,$2,8
       add  $9,$1,$2
       sw   $3,cc
       sw   $6,dd
       sw   $9,ee
    
    Two ALU's, each with their own pipelines, multiple issue, register renaming:
    The architecture executes two instruction streams in parallel.
    (Assume only 32 user programmable registers, 80 registers in hardware.)
    
       lw   $1,a           lw   $41,aa
       lw   $2,b           lw   $42,bb
       nop                 nop
       sll  $3,$1,8        sll  $43,$41,8
       sll  $6,$2,8        sll  $46,$42,8
       add  $9,$1,$2       add  $49,$41,$42
       sw   $3,c           sw   $43,cc
       sw   $6,d           sw   $46,dd
       sw   $9,e           sw   $49,ee
    
    
    
    Out of order execution to avoid delays. As seen in the first example,
    changing the order of execution without changing the semantics of the
    program can achieve faster execution.
    
    There can be multiple issue when there are multiple arithmetic and
    other units. This will require significant hardware to detect the
    amount of out of order instructions that can be issued each clock.
    
    Now, hardware can also be pipelined, for example a parallel multiplier.
    Suppose we need to have at most 8 gate delays between pipeline
    registers.
    
    
    
    Note that any and-or-not logic can be converted to use only nand gates
    or only nor gates. Thus, two level logic can have two gate delays.
    
    We can build each multiplier stage with two gate delays. Thus we can
    have only four multiplier stages then a pipeline register. Using a
    carry save parallel 32-bit by 32-bit multiplier we need 32 stages, and
    thus eight pipeline stages plus one extra stage for the final adder.
    
    
    
    Note that a multiply can be started every clock. Thus a multiply
    can be finished every clock. The speedup including the last adder
    stage is 9 as shown in:
    pipemul_test.vhdl
    pipemul_test.out
    pipemul.vhdl
    
    
    
    A 64-bit PG adder may be built with eight or less gate delays.
    The signals a, b and sum are 64 bits. See add64.vhdl for details.
    
    
    
    add64.vhdl
    
    
    
    Any combinational logic can be performed in two levels with "and" gates
    feeding "or" gates, assuming complementation time can be ignored.
    Some designers may use diagrams but I wrote a Quine McClusky minimization
    program that computes the two level and-or-not VHDL statement
    for combinational logic.
    
    quine_mcclusky.c logic minimization
    
    eqn4.dat input data
    
    eqn4.out both VHDL and Verilog output
    
    there are 2^2^N possible functions of N bits
    
    Not as practical, I wrote a Myhill minimization of a finite state machine,
    a Deterministic Finite Automata, that inputs a state transition table
    and outputs the minimum state equivalent machine. "Not as practical" 
    because the design of sequential logic should be understandable. The
    minimized machine's function is typically unrecognizable.
    
    myhill.cpp state minimization
    initial.dfa input data
    myhill.dfa minimized output
    
    
    
    A reasonably complete architecture description for the Alpha
    showing the pipeline is:
    
    basic Alpha
    more complete Alpha
    
    The "Cell" chip has unique architecture:
    
    Cell architecture
    
    Some technical data on Intel Core Duo (With some advertising.)
    
    Core Duo all on WEB
    
    From Intel, with lots of advertising:
    power is proportional to capacitance * voltage^2 * frequency, page 7.
    
    tech overview
    
    whitepaper
    
    
    Intel quad core demonstrated
    
    
    AMD quad core
    
    By 2010 AMD had a 12-core available and Intel had a 8-core available.
     and 24 core and 48 core AMD
    
    
    IBM Power6 at 4.7GHz clock speed
    
    Intel I7 920 Nehalem 2.66GHz not quad   $279.99
    Intel I7 940 Nehalem 2.93GHz quad core  $569.99
    Intel I7 965 Nehalem 3.20GHz quad core  $999.99
    Prices vary with time, NewEgg.com search Intel I7
    
    Motherboard Asus products-motherboards-intel i7
    Intel socket 1366
    
    Supermicro.com motherboards, 12-core
    
    
    local, bad formatting, in case web page goes away. Good history.
    Core Duo 1
    Core Duo 2
    Core Duo 3
    Core Duo 4
    Core Duo 5
    Core Duo 6
    Core Duo 7
    Core Duo 8
    
    HW7 is assigned
    
    

    Lecture 18, Project Outline and VHDL

    
    
    
    Project part1 starts with  part1_start.vhdl
    Search for "???" where you need to do some work.
    !!! remove ??? , ... , they are not legal VHDL.
    
    
    
    
    
    WB_write_enb <=  needs  WB_lwop or WB_lwimop or ...
    		 
    Above: RegDst WORK.equal6  ID_IR(31 downto 26) , "000000"
    Similar for ALUSrc  compare to "000000" get complement,
    ALUSrc <= not complement  
    
    Below: need  "not inB"  signal, into  WORK.mux_32 and new
    output name that also goes into B side of ALU.
    
    with ALU schematic for all, also see more on schematic below.
    
    
    
    All include divide, divcas16 covered in Lecture 8 and provided.
    Use your add32.vhdl from HW4.
    Use your pmul16.vhdl from HW6.
     
    
    Various versions have different signal names for same signal,
    orop_and may be just orop, result of anding oropa with rrop
    
    S_sel may be shortened name for sllop_or_srlop
    S_sel <= sllop_and or srlop_and;
    
    	 
    Remember from cs411_opcodes.txt, sll instruction has bottom
    six bits "000010" and typical code would call that signal sllop.
    But, many instructions could have those bottom bits, thus
    to be sure the instruction is  sll  check top six bits, RRop,
    equal to zero and call that signal  sllop_and.
    Similar for all instructions. Some schematics use a short hand,
    just  sllop  meaning the instruction is an  sll, yet VHDL code
    needs   sllop_and .  
    
    Extracted code to indicate where you need to do some work "...":
    -- part1_start.vhdl   VHDL '93 version using entities from WORK library
    part1_start.vhdl  to modify 
    
    library IEEE;
    use IEEE.std_logic_1164.all;
    
    entity alu_32 is -- given. Do not change this interface
      port(inA    : in  std_logic_vector (31 downto 0);
           inB    : in  std_logic_vector (31 downto 0);
           inst   : in  std_logic_vector (31 downto 0);
           result : out std_logic_vector (31 downto 0));
    end entity alu_32;
    
    architecture schematic of alu_32 is 
      signal cin       : std_logic := '0';
      signal cout      : std_logic;
    
      signal RRop      : std_logic;
      signal orop      : std_logic;
      signal orop_and  : std_logic;
      signal andop     : std_logic;
      signal andop_and : std_logic;
      signal S_sel     : std_logic;
    -- ??? insert other needed signals
    
      signal mulop      : std_logic;
      signal mulop_and  : std_logic;
      signal divop      : std_logic;
      signal divop_and  : std_logic;
    
      signal aresult : std_logic_vector (31 downto 0);
      signal bresult : std_logic_vector (31 downto 0);
      signal orresult : std_logic_vector (31 downto 0);
      signal andresult : std_logic_vector (31 downto 0);
      signal mulresult : std_logic_vector (31 downto 0);
      signal divresult : std_logic_vector (31 downto 0);
      signal divrem : std_logic_vector (31 downto 0);
      
    begin  -- schematic
      --
      --   REPLACE THIS SECTION FOR PROJECT PART 1
      --   (add the signals you need above "begin"
      --
    
      ORR : entity WORK.equal6 port map(inst(31 downto 26), "000000", RRop);
      Oor:  entity WORK.equal6 port map(inst(5 downto 0), "001101", orop);
      Omul: entity WORK.equal6 port map(inst(5 downto 0), "011011", mulop);
      Odiv: entity WORK.equal6 port map(inst(5 downto 0), "011000", divop);
    -- ??? insert other  xxxop  statements
    
      orop_and  <=orop and RRop;
      mulop_and <=mulop and RRop;
      divop_and <=divop and RRop;
    -- ???  insert other   xxx_and  statements
      
      
      adder: entity WORK.add32 port map(a    => inA,
                                        b    => inB,
                                        cin  => cin,
                                        sum  => aresult,
                                        cout => cout);
    
    
    
      Mul:  entity WORK.pmul16 port map(inA(15 downto 0),
                                        inB(15 downto 0),
                                        mulresult(31 downto 0));
    
      Div:  entity WORK.divcas16 port map(inA(31 downto 0),
                                          inB(15 downto 0),
                                          divresult(15 downto 0),
                                          divrem(15 downto 0));
    
      Omux: entity WORK.mux32_6 port map(in0=>aresult,
                                         in1=>bresult,
                                         in2=>andresult,
                                         in3=>orresult,
                                         in4=>mulresult,
                                         in5=>divquo32,
                                         ct1=>S_sel,
                                         ct2=>andop_and,
                                         ct3=>orop_and,
                                         ct4=>mulop_and,
                                         ct5=>divop_and,
                                         result=>result);
    end architecture schematic;  -- of alu_32
    
    ... big cut
    
    -- put additional debug print here, if needed, delete before submit
    
    end architecture schematic; -- of part1_start
    
    Do a final search for  ???
      Oh! You need to compute WB_RRop.
      You know RRop is register to register operations  add, sub, ...
      that has 6 zeros in instruction bits  31 downto 0.
      WB  write back stage instruction is WB_IR.
      WBrrop: entity WORK.equal6 port map( WB_IR(31 downto 26),"000000", WB_RRop);
      similar statement for  WB_addiop  look up "------"
      Of course, you need to define the signals WB_RRop and WB_addiop and
      put the  or ...  inside the  )
        
        
    The additional files needed are:
    part1.abs the program to be executed
    
    part1.run to stop execution, no halt instruction
    
    part1.chk the expected output
    
    cs411_opcodes.txt opcode bit patterns
    You will need to enter opcode bit patterns not in part1_start.vhdl.
    
    Use Makefile_411   to compile and run your .vhdl with Cadence 
    Use Makefile_ghdl  to compile and run your .vhdl with GHDL
    
    
    
    
    Now, work on the ALU
    
    
    The full project writeup:
    cs411_proj.shtml
    
    
    

    Lecture 19, Pipelining Data Forwarding

    
      Data forwarding example   CMSC 411 architecture
    
      Consider the five stage pipeline architecture:
    
      IF instruction fetch, PC is address into memory fetching instruction
      ID instruction decode and register read out of two values
      EX execute instruction or compute data memory address
      M  data memory access to store or fetch a data word
      WB write back value into general register
    
    
             IF       ID          EX        M       WB
        +--+     +--+        +--+     +--+     +--+
        |  |     |  |        | A|-|\  |  |     |  |
        |  |     |  |    /---|  | \ \_|  |     |  |
        |PC|-(I)-|IR|-(R)  = |  | / / |  |-(D)-|  |--+
        |  |     |  |  ^ \---| B|-|/  |  |     |  |  |
        +--+     +--+  |     +--+     +--+     +--+  |
         ^        ^    |      ^   ALU  ^        ^    |
         |        |    |      |        |        |    |
     clk-+--------+-----------+--------+--------+    |
                       |                             |
                       +-----------------------------+
    
      Now consider the instruction sequence:
    
      400  lw  $1,100($0)  load general register 1 from memory location 100
      404  lw  $2,104($0)  load general register 2 from memory location 104
      408  nop
      40C  nop             wait for register $2 to get data
      410  add $3,$1,$2    add contents of registers 1 and 2, sum into register 3
      414  nop
      418  nop             wait for register $3 to get data
      41C  add $4,$3,$1    add contents of registers 3 and 1, sum into register 4
      420  nop
      424  nop             wait for register $4 to get data
      428  beq $3,$4,-100  branch if contents of register 3 and 4 are equal to 314
      42C  add $4,$4,$4    add ..., this is the "delayed branch slot" always exec.
    
    
      The pipeline stage table with NO data forwarding is:
    
      lw   IF ID EX M  WB
      lw      IF ID EX M  WB
      nop        IF ID EX M  WB
      nop           IF ID EX M  WB
      add              IF ID EX M  WB
      nop                 IF ID EX M  WB
      nop                    IF ID EX M  WB
      add                       IF ID EX M  WB
      nop                          IF ID EX M   WB
      nop                             IF ID EX M  WB
      beq                                IF ID EX M  WB
      add                                   IF ID EX M  WB
    
      time 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16
    
    
    This can be significantly improved with the addition of four
    multiplexors and wiring.
    
    
    
             IF       ID                  EX          M       WB
        +--+     +--+           +--+          +--+       +--+
        |  |     |  |           | A|-(X)--|\  |  |       |  |
        |  |     |  |    /-(X)--|  | | |  \ \_|  |       |  |
        |PC|-(I)-|IR|-(R)   | = |  | | |  / / |  |-+-(D)-|  |--+
        |  |     |  |  ^ \-(X)--| B|-(X)--|/  |  | |     |  |  |
        +--+     +--+  |    |   +--+ | |      +--+ |     +--+  |
         ^        ^    |    |    ^   | |  ALU  ^   |      ^    |
         |        |    |    |    |   | |       |   |      |    |
     clk-+--------+--------------+-------------+----------+    |
                       |    |        | |           |           |
                       |    +----------+-----------+           |
                       |             |                         |
                       +-------------+-------------------------+
    
      The pipeline stage table with data forwarding is:
    
      lw   IF ID EX M  WB
      lw      IF ID EX M  WB
      nop        IF ID EX M  WB                 saved one nop
      add           IF ID EX M  WB              $2 in WB and used in EX
      add              IF ID EX M  WB           saved two nop's $3 used
      nop                 IF ID EX M WB         saved one nop        
      beq                    IF ID EX M  WB     $4 in MEM and used in ID
      add                       IF ID EX M  WB 
    
      time 1  2  3  4  5  6  7  8  9  10 11 12
    
    
      Note the required nop from using data immediately after a load.
      Note the required nop for the beq in the ID stage using an ALU result.
    
    
    The data forwarding paths are shown in green with the additional
    multiplexors. The control is explained below.
    
    
    
    Green must be added to part2a.vhdl.
    Blue already exists, used for discussion, do not change.
    
    To understand the logic better, note that MEM_RD contains the register
    destination of the output of the ALU and MEM_addr contains the value
    of the output of the ALU for the instruction now in the MEM stage.
    
    If the instruction in the EX stage has the MEM_RD destination in
    bits 25 downto 21, then MEM_addr must be routed to the A side of the ALU.
    (This is the A forward MEM_addr control signal.)
    
                       EX stage          MEM stage
                     add $4,$3,$1       add $3,$1,$2
                             |               |
                             +---------------+
    
    
    If the instruction in the EX stage has the MEM_RD destination in
    bits 20 downto 16, then MEM_addr must be routed to the B side of the ALU.
    (This is the B forward MEM_addr control signal.)
    
                       EX stage          MEM stage
                     add $4,$1,$3       add $3,$1,$2
                                |            |
                                +------------+
    
    
    To understand the logic better, note that WB_RD contains the register
    destination of the output of the ALU or Memory and WB_result contains
    the value of the output of the ALU or Memory for the instruction now
    in the WB stage.
    
    If the instruction in the EX stage has the WB_RD destination in
    bits 25 downto 21, then WB_result must be routed to the A side of the ALU.
    (This is the A forward WB_result control signal.)
    
    If the instruction in the EX stage has the WB_RD destination in
    bits 20 downto 16, then WB_result must be routed to the B side of the ALU.
    (This is the B forward WB_result control signal.)
    
    Note that a beq instruction in the ID stage that needs a value from
    the instruction in the WB stage does not need data forwarding.
    
    A beq instruction in the ID stage has the MEM_RD destination in
    bits 25 downto 21, then MEM_addr must be routed to the top side of
    the equal comparator.
    (This is the 1 forward control signal.)
    
    A beq instruction in the ID stage has the MEM_RD destination in
    bits 20 downto 16, then MEM_addr must be routed to the bottom side of
    the equal comparator.
    (This is the 2 forward control signal.)
    
               ID stage        EX stage        MEM stage
             beq $3,$4,-100      nop         add $4,$3,$1
                     |                            |
                     +----------------------------+
    
    
    
    A beq instruction in the ID stage has the WB_RD destination in
    bits 20 downto 16, then WB_result must be used by the bottom side of
    the equal comparator.
    (This happens by magic. Not really, two rules above apply.)
    
               ID stage        EX stage    MEM stage    WB stage
             beq $3,$4,-100      nop         nop       lw $4,8($3)
                     |                                     |
                     +-------------------------------------+
    
    
    
    
      The data forwarding rules can be summarized based on the
      cs411 schematic, shown above.
    
      ID stage beq data forwarding: 
    
          default with no data forwarding is ID_read_data_1      
          1 forward MEM_addr is  ID_reg1=MEM_RD and MEM_rd/=0 and MEM_OP/=lw 
      
          default with no data forwarding is ID_read_data_2
          2 forward MEM_addr is  ID_reg2=MEM_RD and MEM_rd/=0 and MEM_OP/=lw 
    
      EX stage data forwarding:
    
          default with no data forwarding is EX_A
          A forward MEM_addr is  EX_reg1=MEM_RD and MEM_RD/=0 and MEM_OP/=lw
          A forward WB_result is  EX_reg1=WB_RD and WB_RD/=0
    
          default with no data forwarding is EX_B
          B forward MEM_addr is  EX_reg2=MEM_RD and MEM_RD/=0 and MEM_OP/=lw
          B forward WB_result is  EX_reg2=WB_RD and WB_RD/=0
    
          Note: the entity mux32_3 is designed to handle the above.
    
      ID_RD is 0 for ID_OP= beq, j, sw (nop, all zeros, automatic zero in RD)
               thus EX_RD, MEM_RD,  WB_RD = 0 for these instructions
               Because register zero is always zero, we can use 0 for
               a destination for every instruction that does not
               produce a result in a register. Thus no data forwarding
               will occur for instructions that do not produce a value
               in a register.
    
    
      note: ID_reg1 is ID_IR(25 downto 21)
            ID_reg2 is ID_IR(20 downto 16)
            EX_reg1 is EX_IR(25 downto 21)
            EX_reg2 is EX_IR(20 downto 16)
            MEM_OP  is MEM_IR(31 downto 26)
            EX_OP   is EX_IR(31 downto 26)
    	ID_OP   is ID_IR(31 downto 26)
    
            These shorter names can be used with  VHDL alias statements
    
            alias  ID_reg1 : word_5 is ID_IR(25 downto 21);
            alias  ID_reg2 : word_5 is ID_IR(20 downto 16);
            alias  EX_reg1 : word_5 is EX_IR(25 downto 21);
            alias  EX_reg2 : word_5 is EX_IR(20 downto 16);
            alias  MEM_OP  : word_6 is MEM_IR(31 downto 26);
            alias  EX_OP   : word_6 is EX_IR(31 downto 26);
            alias  ID_OP   : word_6 is ID_IR(31 downto 26);
    
    
    Why is the priority mux, mux32_3 needed?
    mux32_3.vhdl gives priority to ct1 over ct2
    
    Answer: Consider MEM_RD with a destination value 3 and
    WB_RD with a destination value 3.
    
    What should   add $4,$3,$3 use? MEM_addr or WB_result ?
    
    For this to happen, some program or some person would have
    written code such as:
    
         sub  $3,$12,$11
         add  $3,$1,$2
         add  $4,$3,$3   double the value of $3
    
    Well, rather obviously, the result of the  sub  is never used and
    thus the answer to our question is that MEM_addr must be used. This
    is the closest prior instruction with the required result. The
    correct design is implemented using the priority mux32_3 with the
    MEM_addr in the  in1  priority input.
    
    
    The control signal  A forward MEM_addr  may be implemented in VHDL as:
    
    
    
    btw: 100011 in any_IR(31 downto 26) is the  lw  opcode in this example,
         be sure to check this semesters cs411_opcodes.txt
    
    
    Here is where you may want to add a debug process. Replace AFMA
    with any signal name of interest:
    
       prtAFMA: process (AFMA)
                 variable my_line : LINE; -- my_line needs to be defined
               begin
                 write(my_line, string'("AFMA="));
                 write(my_line, AFMA);         -- or hwrite for long signals
                 write(my_line, string'(" at="));
                 write(my_line, now);         -- "now" is simulation time
                 writeline(output, my_line);  -- outputs line
               end process prtAFMA;
    
    
    part2a.chk has the _RD signals and values
    
    
    cs411_opcodes.txt for op code values
    
    Now, to finish part2a.vhdl, the jump and branch instructions must be
    implemented. This is shown in green on the upper part of the schematic.
    
    
    
    The signal out of the jump address box would be coded in VHDL as:
    
    jump_addr <= PCP(31 downto 28) & ID_IR(25 downto 0) & "00";
    
    The adder symbol is just another instance of your Homework 4, add32.
    
    The "shift left 2" is a simple VHDL statement:
    
    shifted2 <= ID_sign_ext(29 downto 0) & "00";
    
    The project writeup:  part2a
    
    For more debugging, uncommment print process and diff against:
    part2a_print.chk
    part2a_print.chkg
    
    

    Lecture 20, Hazard and Stalls

    
    Our design goal is to eliminate the need for  nop  instructions.
    The design method is to detect the need for a  nop  and stall
    the IF and ID stages of the pipeline, inserting a  nop  into
    the execution stage instruction register, EX_IR.
    
    
      The initial instruction sequence was:
    
      400  lw  $1,100($0)  load general register 1 from memory location 100
      404  lw  $2,104($0)  load general register 2 from memory location 104
      408  nop
      40C  nop             wait for register $2 to get data
      410  add $3,$1,$2    add contents of registers 1 and 2, sum into register 3
      414  nop
      418  nop             wait for register $3 to get data
      41C  add $4,$3,$1    add contents of registers 3 and 1, sum into register 4
      420  nop
      424  nop             wait for register $4 to get data
      428  beq $3,$4,-100  branch if contents of register 3 and 4 are equal to 314
      42C  add $4,$4,$4    add ..., this is the "delayed branch slot" always exec.
    
      The pipeline stage table with data forwarding and automatic hazard
      elimination reduces to:
    
      400 lw  $1,100($0)  IF  ID  EX  M   WB
      404 lw  $2,104($0)      IF  ID  EX  M   WB
      408 add $3,$1,$2            IF  ID  ID  EX  M   WB
                                          --
      40C add $4,$3,$1                IF  IF  ID  EX  M   WB
      410 beq $3,$4,-100                      IF  ID  ID  EX  M   WB
      414 add $4,$4,$4                            IF  IF  ID  EX  M   WB 
    
                     time 1   2   3   4   5   6   7   8   9   10  11  12
        (actually clock count)
        On any clock there can be only one instruction in each pipeline stage.
        Empty stages do not need to be shown, they have an inserted  nop .
        (useful for Homework 8)
    
      Note that the -- indicates that IF stage and ID stage have stalled.
      The -- also indicates a  nop  instruction has  automatically been
      inserted into the EX stage.
    
      A new instruction can not move into the ID stage when an instruction
      is stalled there. A new instruction can not move into the IF stage
      when an instruction is stalled there. No column may have more than
      one instruction in each stage. Any unlisted stage has a nop.
    
      The compiler may now generate compressed code for the computer
      architecture, saving on memory bandwidth because  nop  instructions
      are not needed in the executable memory image. (Except a rare  nop
      instruction after a branch or jump instruction.)
    
    
    The primary task will be the implementation of a "stall" signal
    for the project part2b.vhdl. The "stall" signal will then be used
    to prevent clocking of the instruction fetch, IF stage and
    instruction decode, ID stage by using a new clock signal "sclk".
    The explanation for generating "sclk" is presented below.
    Note that when the  nop  instruction is muxed into EX_IR then
    the EX_RD must be set to zero along with the existing beq, sw and jump.
    
    The changes in part2b.vhdl are in the IF and ID stages.
    Green must be added. The signal "stall" is computed from the
    information presented below.
    
    
    
    A "hazard" is a condition in the pipeline when a stage of the pipeline
    would not perform the correct processing with the available data.
    To be a hazard, the action of data forwarding, covered in the previous
    lecture, must be taken into account.
    
    Some cases where hazards would occur are:
    
         lw  $1,100($0)
         add $2,$1,$1
    
                     EX stage       MEM stage 
                   add $2,$1,$1    lw  $1,100($0)   hazard!
                                                    value for $1 not available
                
        Thus hold  add $2,$1,$1 in ID stage, insert nop in EX, this is a stall.
    
        ID stage     EX stage     MEM stage
      add $2,$1,$1     nop      lw  $1,100($0)      no hazard
       
        ID stage     EX stage     MEM stage    WB stage
                   add $2,$1,$1     nop      lw  $1,100($0)   no hazard
                           |  |                   |
                           +--+-------------------+  data forwarding
                 
    
        add $4,$3,$1
        beq $3,$4,-100
    
           ID stage           EX stage
         beq $3,$4,-100     add $4,$3,$1            hazard!
                                                    value for $4 not available
    
           ID stage           EX stage         MEM stage
         beq $3,$4,-100         nop           add $4,$3,$1         no hazard
                 |                                 |
                 +---------------------------------+   data forwarding
    
    
        lw  $5,40($1)
        beq $5,$4,L2
    
           ID stage          EX stage
         beq $5,$4,L2     lw  $5,40($1)            hazard!
                                                   value for $5 not available
    
    
           ID stage         EX stage     MEM stage
         beq $5,$4,L2        nop       lw  $5,40($1)  hazard!
                                                      value for $5 not available
    
           ID stage        EX stage     MEM stage     WB stage
         beq $5,$4,L2        nop          nop       lw  $5,40($1)    no hazard
              |                                          |
              +------------------------------------------+   normal lw
    
    
    
      Cases for stall hazards (taking into account data forwarding)
      based on cs411 schematic. This is NOT VHDL, just definitions.
    
      Note: ( OP stands for opcode, bits (31 downto 26)
              lw stands for load word opcode "100011"
              addi stands for add immediate opcode "001100" etc.
              rr_op stands for OP = "000000" )
    
      lw  $a, ...
      op  $b, $a, $a  where op is rr_op, beq, sw
    
          stall_lw is EX_OP=lw and EX_RD/=0 and
                      (ID_reg1=EX_RD or ID_reg2=EX_RD)
                      and ID_OP/=lw and ID_OP /=addi and ID_OP/=j
    
          (note: the above handles the special cases where
           sw needs both registers. sll, srl, cmpl have a zero in unused register.
           no stall can occur based on EX_RD, MEM_RD or WB_RD = 0)
    
    
      lw  $a, ...
      lw  $b,addr($a)  or addi $b,addr($a)
    
          stall_lwlw is EX_OP=lw and EX_RD/=0 and
                        (ID_OP=lw or ID_OP=addi) and
                        ID_reg1=EX_RD
    
    
      lw  $a ...
      beq $a,$a, ...
    
          stall_mem is ID_OP=beq and MEM_RD/=0 and MEM_OP=lw and
                       (ID_reg1=MEM_RD or ID_reg2=MEM_RD)
    
    
      op  $a, ...   where op is rr_op and addi
      beq $a,$a, ...  
    
          stall_beq is ID_OP=beq and EX_RD/=0 and
                       (ID_reg1=EX_RD or ID_reg2=EX_RD)
    
    
      ID_RD is 0 for ID_OP= beq, j, sw, stall (nop automatic zero)
               thus EX_RD, MEM_RD, WB_RD = 0 for these instructions
    
      rr_op is "000000" for add, sub, cmpl, sll, srl, and, mul, ...
    
      stall is  stall_lw or stall_lwlw or stall_mem or stall_beq
    
    
    Be sure to use this semesters cs411_opcodes.txt, it changes every semester.
    cs411_opcodes.txt for op codes
    
    
    An partial implementation of  stall_lw  is:
    
    
    to get slw5 use "001100" for  addiop  per  cs411_opcodes.txt
    
    To check on the "stall" signal, you may need to add:
    
         prtstall: process (stall)
                   variable my_line : LINE; -- my_line needs to be defined
                 begin
                   write(my_line, string'("stall="));
                   write(my_line, stall);         -- or hwrite for long signals
                   write(my_line, string'(" at="));
                   write(my_line, now);         -- "now" is simulation time
                   writeline(output, my_line);  -- outputs line
                 end process prtstall;
    
    
    
    stall clock, sclk,  is:
    
         for raising edge registers    clk or stall  (our circuit)
    
    
    
    For checking your results:
    part2b.chk look for inserted nop's
    
    part2b.jpg  complete schematic as jpeg image
    part2b.ps  complete schematic as postscript image
    
    
    Project writeup part2b
    
    
    
    Why is eliminating  nop  from the load image important?
    Answer: memory bandwidth. RAM memory has always been slower than
    the CPU. Often by a factor of 10. Thus, the path from RAM memory
    into the CPU has been made wide. a 64 bit wide memory bus is
    considered small today. 128 bit and 256 bit memory input to the
    CPU is common. 
    
    Many articles have been written that say "adding more RAM to your
    computer will give more performance improvement than adding a
    faster CPU." This is often true because of the complex interaction
    of the operating system, application software, computer architecture
    and peripheral equipment. Adding RAM to most computers is easy and
    can be added by non experts. The important step in adding more RAM
    is to get the correct Dual Inline Memory Modules, DIMM's. There are
    speed considerations, voltage considerations, number of pins and
    possible pairing considerations. The problem is that there are
    many choices. The following table indicates some of the choices yet
    does not include RAM size.
    
    Type  Memory   Symbol     Module      DIMM   Nominal   Memory
          Bus                 Bandwidth   Pins   Voltage   clock
    
    DDR4  1700Mhz  PC4-2133   25.6GB/sec  288    1.2 volt
    
    DDR3  1600Mhz  PC3-12800  12.8GT/sec  240    1.6 volt  200Mhz
                              38.4GB/sec                           may
    DDR3  1333Mhz  PC3-10600  10.7GT/sec  240    1.6 volt  166Mhz  triple
    DDR3  1066Mhz  PC3-8500    8.5GT/sec  240    1.6 volt  133Mhz  channel
    DDR3   800Mhz  PC3-6400    6.4GT/sec  240    1.6 volt  100Mhz  (10ns)
    
    DDR2  1066MHz  PC2-8500   17.0GB/sec  240    2.2 volt  two channel
    DDR2  1000MHz  PC2-8000   16.0GB/sec  240    2.2 volt
    DDR2   900MHz  PC2-7200   14.4GB/sec  240    2.2 volt
    DDR2   800MHz  PC2-6400   12.8GB/sec  240    2.2 volt
    DDR2   667MHz  PC2-5300   10.6GB/sec  240    2.2 volt
    DDR2   533MHz  PC2-4200    8.5GB/sec  240    2.2 volt
    DDR2   400MHz  PC2-3200    6.4GB/sec  240    2.2 volt
    
    DDR    556MHz  PC-4500     9.0GB/sec  184    2.6 volt
    DDR    533MHz  PC-4200     8.4GB/sec  184    2.6 volt
    DDR    500MHz  PC-4000     8.0GB/sec  184    2.6 volt
    DDR    466MHz  PC-3700     7.4GB/sec  184    2.6 volt
    DDR    433MHz  PC-3500     7.0GB/sec  184    2.6 volt
    DDR    400MHz  PC-3200     6.4GB/sec  184    2.6 volt
    DDR    366MHz  PC-3000     5.8GB/sec  184    2.6 volt
    DDR    333MHz  PC-2700     5.3GB/sec  184    2.6 volt
    DDR    266MHz  PC-2100     4.2GB/sec  184    2.6 volt
    DDR    200MHz  PC-1600     3.2GB/sec  184    2.6 volt
    
    Pre DDR had 168 pin 3.3 volt DIMM's.
    Older machines had 72 pin RAM
    
    Then, there is the size of the DIMM in bytes.
    (may need 2 DDR2 or 3 DDR3 in parallel, minimum 6GB DDR3)
    
     128MB
     256MB
     512MB
    1024MB  1GB
    2048MB  2GB
    4096MB  4GB
    
    Then, there is a choice of NON-ECC or ECC, Error Correcting Code
    that may be desired in commercial systems.
    
    Then, possibly a choice of buffered or unbuffered.
    
    Then, a choice of response CL3, CL4, CL5 clock waits.
    (in detail may read  7-7-7-20 notation)
    
    Then, shop by price or manufacturers history of reliability.
    
    Some systems require DIMM's of the same size and speed be installed
    in pairs. Read your computers manual or check for information on
    WEB sites. I have uses the following sites to get information and
    purchase more RAM.
    
    www.crucial.com
    
    You may search by your computers make and model, or by
    DDR2 and see specification to find what is available.
    
    
    www.kingston.com
    
    www.kingston.com KHX8500
    
    www.valueram.com/datsheets/KHX8500D2_1G.pdf
    
    Now, how can an architecture best make use of the combination of
    pipelines and memory. IBM Cell Processor uses an architecture of
    a general purpose CPU on chip with eight additional pipeline
    processors.
    
    
    
    
    
    
    
    
    
    
    
    Cell-tutorial.pdf
    
    HW8 is assigned 
    
    part2b is assigned
    
    For more debugging, uncomment print process and diff against:
    part2b_print.chk
    
    
    

    Lecture 21, Cache

    
    The "cache" is very high speed memory on the CPU chip.
    Typical CPU's can get words out of the cache every clock.
    In order to be as fast as the logic on the CPU, the cache
    can not be as large as the main memory. Typical cache sizes
    are hundreds of kilobytes to a few megabytes.
    
    There is typically a level 1 instruction cache, a level 1
    data cache. These would be in the blocks on our project
    schematic labeled instruction memory and data memory.
    
    Then, there is typically a level 2 unified cache that is
    larger and may be slower than the level 1 caches. Unified
    means it is used for both instructions and data.
    
    Some computers have a level 3 cache that is larger and
    slower than the level 2 cache. Multi core computers
    have at least a L1 instruction cache and a L1 data cache
    for every core. Some have a L3 unified cache that is
    available to all cores. Thus data can go from one core
    to another without going through RAM.
    
    
         +-----------+   +-----------+
         | L1 Icache |   | L1 Dcache |
         +-----------+   +-----------+
               |               |
         +---------------------------+
         | L2 unified cache          |
         +---------------------------+
                  |
               +------+
               | RAM  |
               +------+
                  |
               +------+
               | Disc |  or Solid State Drive, SSD
               +------+
    
    The goal of the architecture is to use the cache for instructions
    and data in order to execute instructions as fast as possible.
    Typical RAM requires 5 to 10 clocks to get an instruction or
    data word. A typical CPU does prefetching and branch prediction
    to bring instructions into the cache in order to minimize
    stalls waiting for instructions. You will simulate a cache and
    the associated stalls in part 3 of your project.
    
    Intel IA-64 cache structure, page 3
    IA-64 Itanium
    
    
    An approximate hierarchy is:
    
                    size    response
         CPU                  0.5 ns  2 GHz clock
         L1 cache  .032MB     0.5 ns  one for instructions, another for data
         L2 cache     4MB     1.0 ns
         RAM       4000MB     4.0 ns
         disk    500000MB     4.0 ms = 4,000,000 ns
    
    A program is loaded from disk, into RAM, then as needed
    into L2 cache, then as needed into L1 cache, then as needed
    into the CPU pipelines.
    1)  The CPU initiates the request by sending the L1 cache an address.
        If the L1 cache has the value at that address, the value is quickly
        sent to the CPU.
    2)  If the L1 cache does not have the value, the address is passed to
        the L2 cache. If the L2 cache has the value, the value is quickly
        passed to the L1 cache. The L1 cache passes the value to the CPU.
    3)  If the L2 cache does not have the value at the address, the
        address is passed to a memory controller that must access RAM
        in order to get the value. The value passes from RAM, through
        the memory controller to the L2 cache then to the L1 cache then
        to the CPU.
    
    This may seem tedious yet each level is optimized to provide good
    performance for the total system. One reason the system is fast is
    because of wide data paths. The RAM data path may be 128-bits or
    256-bits wide. This wide data path may continue through the
    L2 cache and L1 cache. The cache is organized in blocks
    (lines or entries may be used in place of the word blocks)
    that provide for many bytes of data to be accessed in parallel.
    When reading from a cache, it is like combinational logic, it
    is not clocked. When writing into a cache it must write on
    a clock edge.
    
    A cache receives an address, a computer address, a binary number.
    The parts of the cache are all powers of two. The basic unit of
    an address is a byte. For our study, four bytes, one word, will
    always be fetched from the cache. When working the homework
    problems be sure to read the problem carefully to determine if
    the addresses given are byte addresses or word addresses.
    It will be easiest and less error prone if all addresses are
    converted to binary for working the homework.
    
    The basic elements of a cache are:
      A valid bit: This is a 1 if values are in the cache block
      A tag field: This is the upper part of the address for
                   the values in the cache block.
      Cache block: The values that may be instructions or data
    
    In order to understand a simple cache, follow the sequence of word
    addresses presented to the following cache.
    
    
    
    
      Sequence of addresses and cache actions
    
      decimal  binary    hit/miss   action
              tag index
         1    000 001    miss       set valid, load data
         2    000 010    miss       set valid, load data
         3    000 011    miss       set valid, load data
         4    000 100    miss       set valid, load data
        10    001 010    miss       wrong tag, load data
        11    001 011    miss       wrong tag, load data
         1    000 001    hit        no action
         2    000 010    miss       wrong tag, load data
         3    000 011    miss       wrong tag, load data
        17    010 001    miss       wrong tag, load data
        18    010 010    miss       wrong tag, load data
         2    000 010    miss       wrong tag, load data
         3    000 011    hit        no action
         4    000 100    hit        no action
    
    
    
    
    
      Sequence of addresses and cache actions
    
      decimal    binary     hit/miss   action
             tag index word
         1    00   00  01    miss      set valid, load data (0)(1)(2)(3)
         2    00   00  10    hit       no action
         3    00   00  11    hit       no action
         4    00   01  00    miss      set valid, load data (4)(5)(6)(7)
        10    00   10  10    miss      set valid, load data (8)(9)(10)(11)
        11    00   10  11    hit       no action
         1    00   00  01    hit       no action
         2    00   00  10    hit       no action
         3    00   00  11    hit       no action
        17    01   00  01    miss      wrong tag, load data (16)(17)(18)(19)
        18    01   00  10    hit       no action
         2    00   00  10    miss      wrong tag, load data (0)(1)(2)(3)
         3    00   00  11    hit       no action
         4    00   01  00    hit       no action
    
    
    There are many cache organizations. The ones you should know are:
    
    A direct mapped cache: the important feature is one tag comparator.
    
    An associative cache:  the important feature is more than one tag
                           comparator. "Two way associative" means two
                           tag comparators. "Four way associative means
                           four tag comparators.
    
    A fully associative cache: Every tag slot has its own comparator.
                               This is expensive, typically used for TLB's.
    
    For each organization the words per block may be some power of 2.
    
    For each organization the number of blocks may be some power of 2.
    
    The size of the address that the cache must accept is determined by
    the CPU. Note that the address is partitioned starting with the
    low order bits. Given a byte address, the bottom two bits do
    not go to the cache. The next bits determine the word. If there
    are 4 words per block, 2-bits are needed, if there are 8 words per
    block, 3-bits are needed, if there are 16 words per block 4-bits
    are needed. 2^4=16 or number of bits is log base 2 of number of words.
    The next bits are called the index and basically address a block.
    For 2^n blocks, n bits are needed. The top bits, whatever is not
    in the byte, word or index are the tag bits.
    
    Given a 32-bit byte address, 8 words per block, 4096 blocks you would
    have:  byte   2-bits
           word   3-bits
           index 12-bits
           tag   15-bits
                ----        +-----+-------+------+------+
          total  32-bits    | tag | index | word | byte |  address
                            +-----+-------+------+------+
                               15    12      3      2
    
    To compute the number of bits in this cache:
        4096 x 8 words at 32 bits per word = 1,048,576
        4096 x 15 bits tags                =    61,440
        4096 x 1  bits valid bits          =     4,096
                                            ----------
                                total bits = 1,114,112 (may not be a power 0f 2)
    
    
    Each cache block or line or entry, for this example has:
    
           valid  tag     8 words data or instructions
            +-+  +----+  +----------------------------+
            |1|  | 15 |  | 8*32=256 bits              |  total 272 bits
            +-+  +----+  +----------------------------+
    
    then 12 bit index means 2^12=4096 blocks.  4096 * 272 = 1,114,112  bits.
    
    
    
    Cache misses may be categorized by the reason for the miss:
    
    Compulsory miss: The first time a word is used and the block that
                     contains that word has never been used.
    
    Capacity miss: A miss that would have been a hit if the cache was big enough.
    
    Conflict miss: A miss that would have been a hit in a fully associative cache.
    
    
    The "miss penalty" is the time or number of clocks that are required to
    get the data value.
    
    
    Data caches have two possible architectures in addition to all
    other variations. Consider the case where the CPU is writing
    data to RAM, our store word instruction. The data actually is
    written into the L1 data cache by the CPU. There are now
    two possibilities:
    
      Write back cache: the word is written to the cache. No memory access
                        is made until the block where the word is written
                        is needed, at which time the entire block is 
                        written to RAM. It is possible the word could be
                        written, and read, many times before any memory access.
    
      Write through cache: the word is written to the cache and the single
                           word is sent to the RAM memory. This causes to
                           RAM memory to be accessed on every store word but
                           there is no block write when the block is needed
                           for other data. Most of the memory bandwidth
                           is wasted on a wide 128 or 256 bit memory bus.
    
      Tradeoff: Some motherboards have a jumper that you can change to
                have a write back or write through cache. My choice is
                a write back cache because I find it gives my job mix
                better performance.
    
    
    16 words per block. Note partition of address bits.
    
    
    
    
    A four way associative cache. Note four comparators.
    Each of the four caches could be any of the above architectures
    and sizes.
    
    
    
    
    Homework 9 on cache
    
    
    The motherboard is essential to support the CPU, RAM and
    other devices.
    
    Battle of the MotherBoards
    
    An Asus motherboard example
    
    Asus motherboards
    
    2007 Mother Boards, note RAM and hard drive capability
    
    Graphics Cards for mother boards without enough power
    
    Latest high speed IBM Power6, 448 cores at 4.7Ghz
    Water cooled
    
    

    Lecture 22, Cache Performance

    
    Cache "miss rate" is used as a measure of cache performance.
    
    Given 10 accesses to a cache, 9 hits and 1 miss,
    the miss rate = 1/10 = 10%
    
    Because there must always be compulsory misses, the miss rate
    can never be zero. On some plots below, the miss rate is 1%
    meaning a 99% hit rate.
    
    The importance of the plots is not the numbers, rather the trends.
    Note that this was based on SPEC92, over 20 years ago. Programs
    were much smaller back then, yet the trend for performance is the
    same today. Caches are scaled up today, 1MB and 2MB caches are
    common and 8MB caches are available.
    
    
    Cache performance based on two factors:
    1) Cache size            (bigger is better)
    2) Cache associativity   (more is better)
    
    
    
    
    A 4 way associative cache. Count tag equal comparators.
    
    
    
    Cache performance based on two factors:
    1) cache size   (bigger is better)
    2) block size   (more is usually better, but not for small caches!)
    
    
    
    Caches hold a small part of memory in the CPU for fast access.
    The following two sets of memory usage are from my computers and
    show the size of some programs on Windows and Linux.
    
    Memory usage on Windows XP:
      37 processes
         Windows Explorer   18,104 KB   18 MB too big for cache
         Firefox            21,216 KB
         Photoshop          29,496 KB
         etc.
                 total     163,000 KB   163MB of 512MB used.
    
    You would want good performance by keeping most of a program
    in cache. Thus, the need for caches in the megabytes.
    
    
    
    
    Memory usage on RedHat Linux:
      83 processes, 3 running
         X                 38,119 KB  way too big for cache
         Firefox           20,083 KB
         Gimp               5,402 KB  with extras running
         etc
    
    running   top    reports:
                             306 MB memory used
                             195 MB memory free
                              14 MB memory buff
    
    From:  ps -Al                         ## memory size in KB
    F S   UID   PID  PPID  C PRI  NI ADR  SZ WCHAN  TTY          TIME CMD
    4 S     0     1     0  1  75   0 -   345 schedu ?        00:00:04 init
    1 S     0     2     1  0  75   0 -     0 contex ?        00:00:00 keventd
    1 S     0     3     1  0  75   0 -     0 schedu ?        00:00:00 kapmd
    1 S     0     4     1  0  94  19 -     0 ksofti ?        00:00:00 ksoftirqd_C
    1 S     0     9     1  0  85   0 -     0 bdflus ?        00:00:00 bdflush
    1 S     0     5     1  0  75   0 -     0 schedu ?        00:00:00 kswapd
    1 S     0     6     1  0  75   0 -     0 schedu ?        00:00:00 kscand/DMA
    1 S     0     7     1  0  75   0 -     0 schedu ?        00:00:00 kscand/Norm
    1 S     0     8     1  0  75   0 -     0 schedu ?        00:00:00 kscand/High
    1 S     0    10     1  0  75   0 -     0 schedu ?        00:00:00 kupdated
    1 S     0    11     1  0  85   0 -     0 md_thr ?        00:00:00 mdrecoveryd
    1 S     0    15     1  0  75   0 -     0 end    ?        00:00:00 kjournald
    1 S     0    73     1  0  85   0 -     0 end    ?        00:00:00 khubd
    1 S     0  1012     1  0  75   0 -     0 end    ?        00:00:00 kjournald
    1 S     0  1137     1  0  85   0 -     0 end    ?        00:00:00 kjournald
    1 S     0  3676     1  0  84   0 -   524 schedu ?        00:00:00 dhclient
    5 S     0  3727     1  0  75   0 -   369 schedu ?        00:00:00 syslogd
    5 S     0  3731     1  0  75   0 -   344 do_sys ?        00:00:00 klogd
    5 S    32  3749     1  0  75   0 -   388 schedu ?        00:00:00 portmap
    5 S    29  3768     1  0  75   0 -   391 schedu ?        00:00:00 rpc.statd
    1 S     0  3812     1  0  75   0 -     0 end    ?        00:00:00 rpciod
    1 S     0  3813     1  0  85   0 -     0 schedu ?        00:00:00 lockd
    5 S     0  3825     1  0  84   0 -   343 schedu ?        00:00:00 apmd
    5 S     0  3841     1  0  85   0 -  5014 schedu ?        00:00:00 ypbind
    1 S     0  3945     1  0  75   0 -   372 pipe_w ?        00:00:00 automount
    1 S     0  3947     1  0  75   0 -   372 pipe_w ?        00:00:00 automount
    1 S     0  3949     1  0  75   0 -   372 pipe_w ?        00:00:00 automount
    5 S     0  3968     1  0  85   0 -   879 schedu ?        00:00:00 sshd
    5 S    38  3989     1  0  75   0 -   601 schedu ?        00:00:00 ntpd
    1 S     0  4013     1  0  75   0 -     0 schedu ?        00:00:00 afs_rxliste
    1 S     0  4015     1  0  75   0 -     0 end    ?        00:00:00 afs_callbac
    1 S     0  4017     1  0  75   0 -     0 schedu ?        00:00:00 afs_rxevent
    1 S     0  4019     1  0  75   0 -     0 schedu ?        00:00:00 afsd
    1 S     0  4021     1  0  75   0 -     0 schedu ?        00:00:00 afs_checkse
    1 S     0  4023     1  0  75   0 -     0 end    ?        00:00:00 afs_backgro
    1 S     0  4025     1  0  75   0 -     0 end    ?        00:00:00 afs_backgro
    1 S     0  4027     1  0  75   0 -     0 end    ?        00:00:00 afs_backgro
    1 S     0  4029     1  0  75   0 -     0 end    ?        00:00:00 afs_cachetr
    5 S     0  4037     1  0  75   0 -   354 schedu ?        00:00:00 gpm
    1 S     0  4046     1  0  75   0 -   358 schedu ?        00:00:00 crond
    5 S    43  4078     1  0  76   0 -  1226 schedu ?        00:00:00 xfs
    1 S     2  4087     1  0  85   0 -   355 schedu ?        00:00:00 atd
    4 S     0  4306     1  0  82   0 -   340 schedu tty1     00:00:00 mingetty
    4 S     0  4307     1  0  82   0 -   340 schedu tty2     00:00:00 mingetty
    4 S     0  4308     1  0  82   0 -   340 schedu tty3     00:00:00 mingetty
    4 S     0  4309     1  0  82   0 -   340 schedu tty4     00:00:00 mingetty
    4 S     0  4310     1  0  82   0 -   340 schedu tty5     00:00:00 mingetty
    4 S     0  4311     1  0  82   0 -   340 schedu tty6     00:00:00 mingetty
    4 S     0  4312     1  0  75   0 -   616 schedu ?        00:00:00 kdm
    4 S     0  4325  4312  1  75   0 - 38119 schedu ?        00:00:02 X
    5 S     0  4326  4312  0  77   0 -   877 wait4  ?        00:00:00 kdm
    4 S 12339  4352  4326  0  85   0 -  1143 rt_sig ?        00:00:00 csh
    0 S 12339  4393  4352  0  79   0 -  1034 wait4  ?        00:00:00 startkde
    1 S 12339  4394  4393  0  75   0 -   785 schedu ?        00:00:00 ssh-agent
    1 S 12339  4436     1  0  75   0 -  5012 schedu ?        00:00:00 kdeinit
    1 S 12339  4439     1  0  75   0 -  5440 schedu ?        00:00:00 kdeinit
    1 S 12339  4442     1  0  75   0 -  5742 schedu ?        00:00:00 kdeinit
    1 S 12339  4444     1  0  75   0 -  9615 schedu ?        00:00:00 kdeinit
    0 S 12339  4454  4436  0  75   0 -  2149 schedu ?        00:00:00 artsd
    1 S 12339  4474     1  0  75   0 - 10689 schedu ?        00:00:00 kdeinit
    0 S 12339  4481  4393  0  75   0 -   341 schedu ?        00:00:00 kwrapper
    1 S 12339  4483     1  0  75   0 -  9466 schedu ?        00:00:00 kdeinit
    1 S 12339  4484  4436  0  75   0 -  9772 schedu ?        00:00:00 kdeinit
    1 S 12339  4486     1  0  75   0 -  9908 schedu ?        00:00:00 kdeinit
    1 S 12339  4488     1  0  75   0 - 10299 schedu ?        00:00:00 kdeinit
    1 S 12339  4489  4436  0  75   0 -  5085 schedu ?        00:00:00 kdeinit
    1 S 12339  4493     1  0  75   0 -  9698 schedu ?        00:00:00 kdeinit
    0 S 12339  4494  4436  0  75   0 -  2942 schedu ?        00:00:00 pam-panel-i
    4 S     0  4495  4494  0  75   0 -   389 schedu ?        00:00:00 pam_timesta
    1 S 12339  4496  4436  0  75   0 -  9994 schedu ?        00:00:00 kdeinit
    1 S 12339  4497  4436  0  75   0 - 10010 schedu ?        00:00:00 kdeinit
    1 S 12339  4500     1  0  75   0 -  9503 schedu ?        00:00:00 kalarmd
    0 S 12339  4501  4496  0  75   0 -  1165 rt_sig pts/2    00:00:00 csh
    0 S 12339  4502  4497  0  75   0 -  1159 rt_sig pts/1    00:00:00 csh
    0 S 12339  4546  4501  0  85   0 -  1039 wait4  pts/2    00:00:00 firefox
    0 S 12339  4563  4546  0  85   0 -  1048 wait4  pts/2    00:00:00 run-mozilla
    0 S 12339  4568  4563  1  75   0 - 20083 schedu pts/2    00:00:01 firefox-bin
    0 S 12339  4573     1  0  75   0 -  1682 schedu pts/2    00:00:00 gconfd-2
    0 S 12339  4583  4502  0  75   0 -  5402 schedu pts/1    00:00:00 gimp
    0 S 12339  4776  4583  0  85   0 -  2140 schedu pts/1    00:00:00 script-fu
    1 S 12339  4779  4436  1  75   0 -  9971 schedu ?        00:00:00 kdeinit
    0 S 12339  4780  4779  0  75   0 -  1155 rt_sig pts/3    00:00:00 csh
    0 R 12339  4803  4780  0  80   0 -   856 -      pts/3    00:00:00 ps
    
    
    A benchmark that was designed to note discontinuity in time
    as the data size increased exceeding the L1 cache, L2 cache.
    It would take hours if the program exceeded RAM and went to
    virtual memory on disk!
    
    The basic code, a simple matrix times matrix multiply:
    
     /* matmul.c  100*100 matrix multiply */
     #include <stdio.h>
     #define N 100
     int main()
     {
       double a[N][N]; /* input matrix */
       double b[N][N]; /* input matrix */
       double c[N][N]; /* result matrix */
       int i,j,k;
    
       /* initialize */
       for(i=0; i<N; i++){    /* FYI in debugger, this is line 13 */
         for(j=0; j<N; j++){
           a[i][j] = (double)(i+j);
           b[i][j] = (double)(i-j);
         }
       }
       printf("starting multiply \n");
    
       for(i=0; i<N; i++){
         for(j=0; j<N; j++){
           c[i][j] = 0.0;
           for(k=0; k<N; k++){  /* how many instructions are in this loop? */
             c[i][j] = c[i][j] + a[i][k]*b[k][j]; /* most time spent here! */
    	                  /* this statement is executed one million times */
           }
         }
       }
       printf("a result %g \n", c[7][8]); /* prevent dead code elimination */
       return 0;
     }
    
    The actual code:
    time_matmul.c
    and results:
    time_matmul_1ghz.out
    time_matmul_p4_25.out
    time_matmul_2100.out
    
    Test results on two computers using same executable:
    
    
    
    
    A fact you should know about memory usage:
    If your program gets more memory while running, e.g. using malloc,
    then tries to release that memory when not needed, e.g. free,
    the memory still belongs to your process. The memory is not
    given back to the operating system for use by another program.
    Thus, some programs keep growing in size as they run. Hopefully,
    internally, reusing any memory they previously freed.
    
    
    On Linux you can use  cat  /proc/cpuinfo  to see brief cache size
    CS machine cpuinfo
    source code time_mp8.c
    measured time_mp8.out
    
    
    We have seen the Intel P4 architecture, and here is a view of
    the AMD Athlon architecture circa 2001.
    
    9 pipelines, possibly 9 instruction issued per clock, 3 is typical.
    
    
    
    
    You can find out your computers cache sizes and speeds:
    
    www.memtest86.com
    Get the  .bin  file to make a bootable floppy
    Get the  .iso  file to make a bootable CD
    
    As part of the output, you do not have to run the memory test,
    you will see cache sizes and bandwidth values. (Shown on plot above.)
    
    part3a is assigned
    
    

    Lecture 23, Virtual Memory 1

    
    Most modern computers use the programmers addresses as virtual
    addresses. The virtual addresses must be converted to physical
    addresses in order to access data and instructions in RAM.
    
    The RAM is divided into many pages. A page is some number of
    bytes that is a power of 2. A page could be as small as 2^12=4096
    bytes up to 2^16=65536 bytes or larger. The page offset is the
    address within a specific page. The offset is 12-bits for a
    4096 byte page and 16-bits for a 65536 byte page.
    
    The virtual address and physical address do not necessarily
    have to be the same number of bits. The operation of virtual
    memory is to convert a virtual address to a physical address:
    
              Programmers Virtual Address
      +----------------------------+-------------+
      |    Virtual Page Number VPN | page offset |
      +----------------------------+-------------+
                   |                    |
                   v                    |
                  TLB                   |
                   |                    |
                   v                    v
        +--------------------------+-------------+
        | Physical Page Number PPN | page offset |
        +--------------------------+-------------+
                    RAM Physical Address
    
    TLB is the acronym for Translation Lookaside Buffer. The TLB
    is the hardware on the CPU that converts the virtual page number
    to a physical page number. The Operating System is the resource
    manager and thus assigns each process the physical page numbers
    that the process may use. The virtual page numbers come from the
    programmers source code through compiler, assembler and loader
    onto disk. The addresses you saw in HW3 were virtual addressees.
    Not the address your program actually ran in RAM.
    
    
    
    
    Two programs, p1 and p2, with code segments  p1c and p2c,
    and data segments p1d and p2d. The operating system runs
    a simple program as a process. Now, each segment is
    divided into pages. p1c0, p1c1, p1c2 are the first three
    pages of program 1 code segment. These are virtual pages.
    These pages may be loaded into any physical pages in RAM.
    Each segment is consecutive as stored on disk as an
    executable program.
       disk pages, each line is a page
            ...
            p1c0   executable program 1
            p1c1
            p1c2
            p1d0
            p1d1
            ...
            p2c0   executable program 2
            p2c1
            p2d0
            p2d1
            p2d2
    
    There are also other types of segments.
    You may recall from Homework 3, the address of
    "main" was  0x08048390  28 bits of virtual address.
    
    The page size may be chosen by the operating system author or
    in some computer architectures the page size is determined by
    the hardware, as shown below.
    
    As time goes on, the operating system allocates and frees physical pages.
    Physical memory could look like this at some time:
    (Each line is a page, e.g. 8192 bytes)
    
          os0  operating system pages
          os1
          ...
          osn
          p2d3  somewhat randomly scattered pages
          p1c2
          empty
          p2c0
          p2c1   
          p1d5
          p1c4
          etc
    
    Pages for a program may not be contiguous.
    Pages for a segment of a program may not be contiguous.
    Basically, any virtual page can be in any physical page.
    Code and data segments may not all be in physical memory.
    
    
    A TLB attached to a cache. Any cache could be used,
    a simple one word per block cache is shown.
    Note that the TLB is fully associative.
    
    
    
    
    A flow diagram showing the logical steps to get from
    an executable programs virtual address to a physical address
    that can access RAM.
    
    
    
    
    
    Note that a TLB is a cache yet it typically has some extra
    complexity. In addition to the valid bit there may be a
    "read only" bit that can easily prevent a store operation
    into a page. Another bit may be an "execute only" bit for
    instruction pages that prevents both load and store operations.
    
    A required bit is a "dirty" bit. Consider a page that is
    referenced: The page must be loaded from disk or may be
    a created page of zeros in RAM. Then eventually that page
    in RAM is needed for some process. If any store operation
    changed that page in RAM, the page must be written out
    to disk. The page is "dirty" meaning changed. If the page
    in RAM is not dirty, the new page information just
    over writes the physical page in RAM with some other
    page.
    
    A significant performance requirement for the operating
    system is to efficiently handle paging. If there are
    no physical pages on the OS free page list, a Least
    Recently Used, LRU, strategy is typically used to
    choose a page to over write.
    
    The specific architecture of the TLB must be known in order
    to compute the number of bits of storage needed.
    
    Given: a 36-bit virtual address,
           a 32-bit physical address,
           a 8192 byte page:
    Compute:
     log2 8192 = 13-bit page offset. (2^13=8192)
     Thus the VPN is 36-13 = 23-bits
          the PPN is 32-13 = 19-bits
    
     or, drawn
              Programmers Virtual Address 36-bits
      +----------------------------+-------------+
      |    Virtual Page Number VPN | page offset |
      |      23-bits               |  13-bits    |
      +----------------------------+-------------+
                   |                    |
                   v                    |
                  TLB                   |
                   |                    |
                   v                    v
        +--------------------------+-------------+
        | Physical Page Number PPN | page offset |
        |  19-bits                 |  13-bits    |
        +--------------------------+-------------+
                    RAM Physical Address  32-bits
    
    Given 128 blocks in the TLB,
          3 bits for valid, dirty and ref
    Compute:
       log2 128 = 7-bits in TLB index
       VPN = 23 - 7-bit index gives 16 bits in TLB tag
    or drawn
        V D R tag 16-bits       PPN 19-bits       3+16+19=38-bits
       +-+-+-+------------+---------------------+
       | | | |            |                     |
       +-+-+-+------------+---------------------+
                    ...                            128 of these
       +-+-+-+------------+---------------------+
       | | | |            |                     |
       +-+-+-+------------+---------------------+
    
    thus 38 * 128 = 4864 bits in TLB.
    
    Now, given a simple page table is used, indexed by VPN,
    the page table has 2^VPN = 2^23 = 8,388,608 entries.
    
    Given a page table with three control bits V,D and R
    and a Physical Page Number then the page table needs
     1 + 1 + 1 + 19 = 22 bits.
    Total bits 22 * 8,388,608 = 184,549,376 bits.
    Using power of 10, 184*10^6, 184 million bits, Each 
    process requires a page table. Fortunately, the OS uses
    intelligence and only builds a page table big enough
    for the size of the program or possibly for only the
    pages that are actually used. The page table itself is
    in a page and may, of course, be paged out. :)
    
    A reminder on bits in address vs size of storage:
      bits    size              approximate
       10     kilobyte  2^10    10^3
       20     megabyte  2^20    10^6
       30     gigabyte  2^30    10^9
       40     terabyte  2^40    10^12
       50     petabyte  2^50    10^15
       60     exabyte   2^60    10^18
    
    Actually, modern computers use a hierarchy of page tables.
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    See Homework 10
    
    
    

    Lecture 24, Virtual Memory 2

    
    This lecture covers the software interface to the computer
    architecture. Note that Unix was around many years before
    MS DOS and MS Windows, thus similar capability.
    
    
    Just a little history from the current man page for  gcc.
    Note: The term "text" and "text segment" are instructions,
    executable code.
    
    From  man gcc    then  /segment
    
    -fwritable-strings
        Store string constants in the writable data segment and don't
        uniquize them.  This is for compatibility with old programs which
        assume they can write into string constants.
    
        Writing into string constants is a very bad idea; ''constants''
        should be constant.
    
        This option is deprecated.
    
    -fconserve-space
        Put uninitialized or runtime-initialized global variables into the
        common segment, as C does.  This saves space in the executable at
        the cost of not diagnosing duplicate definitions.  If you compile
        with this flag and your program mysteriously crashes after "main()"
        has completed, you may have an object that is being destroyed twice
        because two definitions were merged.
    
        This option is no longer useful on most targets, now that support
        has been added for putting variables into BSS without making them
        common.
    
    -msep-data
        Generate code that allows the data segment to be located in a dif-
        ferent area of memory from the text segment.  This allows for
        execute in place in an environment without virtual memory manage-
        ment.  This option implies -fPIC.
    
    -mno-sep-data
        Generate code that assumes that the data segment follows the text
        segment.  This is the default.
    
       in same page, better       more likely bad               
    
       ===============            ===============  page boundary
       +-------------+            +-------------+
       |             | buffer     |             |    buffer
       |    code     | over run   |   data      |    over run
       +-------------+ backward   +-------------+    forward
       |             | into       |             |    into
       |    data     | code       |   code      |    code
       +-------------+            +-------------+
       ===============            ===============   page boundary
    
       Best if code and data not in same page   
       The page can then be "read-only" or "execute-only"
    
    -mid-shared-library
        Generate code that supports shared libraries via the library ID
        method.  This allows for execute in place and shared libraries in
        an environment without virtual memory management.  This option
        implies -fPIC.
    
    We will see -fPIC is used directly, below.
    
    
    Now, consider an operating system that allocated physical pages,
    via the TLB:
    1)  that contained only code - set to execute only or read only
    2)  that contained constant data - set to read only
    3)  that contained variables, including stack and heap - writable
    
    Any virus or Trojan that tried to overwrite code would be trapped.
    No possible "buffer overrun" or other malicious action could occur.
    
    But, today's operating systems may put both code and variables into
    the same physical page. This is most common with .so and .dll files.
    Thus, the hacker can cause data to be written over your programs
    instructions. What is written are the harmful instructions to
    erase your hard drive or do other damage. This is a legacy OS code
    problem that dates back to small core memory systems. There does not
    seem to be a willingness to fix this, currently, dangerous situation.
    
    e.g. How could displaying a .jpg image allow a virus?
    Oh! Because some idiot believed the size in the header
    and kept reading data that over wrote instructions.
    Double Yuk! 1) Not checking size  2) code and data in same segment
    
    Thus, they helped create cybercrime and thus cyberdefense.
    
    As a part of MS Windows is DOS, now often called a command window
    or command prompt. Just typing "help" list most available commands.
    Different names for similar file types and commands are:
    
    Unix, Linux, MacOSX          MS Windows      description
    
    .o                           .obj            relocatable object file
    <no extension>               .exe            executable file
    .so                          .dll            shared object, dynamic link load
    .a                           .lib            library of relocatable object files
                                                 statically linked inside executable
    .c                           .c              "C" source file
    gcc -c xxx.c                 cl /C xxx.c     just make relocatable object file
    ar -crv libxxxx.a            cl /LD xxxx.lib build a library file of many
                                                 relocatable object files
            -lxxxx                    xxxx.lib   use library file
    
    
    
    An example of building a self contained executable from a  .a  library
    and an executable that needs a shared object  .so  available:
    
    A self contained executable can be distributed as a single file for
    a specific operating system.
    
    An executable file that links to .so or .dll files will be much
    smaller and only one copy of the .so or .dll file needs to be
    in RAM, even when many executable programs need them.
    The .so or .dll files must be distributed with the executable file.
    
    
    First, the main programs and the four little C library functions that
    print their name in execution:
    
     /* ax.c  for  libax.a  test */
     #include <stdio.h>
     int main()
     {
       printf("In ax main \n");
       abc();
       xyz();
       return 0;
     }
    
     /* abc.c for libax.a test */
     #include <stdio.h>
     void abc()
     { printf("In abc \n"); }
    
     /* xyz.c  for libax.a test */
     #include <stdio.h>
     void xyz()
     { printf("In xyz \n"); }
    
     /* ab.c  for  libab.so  test */
     #include <stdio.h>
     int main()
     {
       printf("In ab main \n");
       aaa();
       bbb();
       return 0;
     }
    
     /* aaa.c for libab.so test */
     #include <stdio.h>
     void aaa()
     { printf("In aaa \n"); }
    
     /* bbb.c for libab.so test */
     #include <stdio.h>
     void bbb()
     { printf("In bbb \n"); }
    
     Then, the Makefile_so
     # Makefile_so  demo  ar  and  ld  and  shared library .so
    
     all: ax ab
    
     ax : ax.c  abc.c  xyz.c
    	gcc -c abc.c               # compile for library
    	gcc -c xyz.c
    	ar crv libax.a abc.o xyz.o # build library
    	ranlib libax.a
    	rm -f *.o
    	gcc -o ax ax.c -L. -lax    # use library  libax.a
    	./ax                       # execute
    
     ab : ab.c aaa.c bbb.c
    	gcc -c -fpic -shared aaa.c  # compile for library
    	gcc -c -fpic -shared bbb.c
    	ld  -o libab.so -shared aaa.o bbb.o -lm -lc
    	rm -f *.o
    	gcc -o ab ab.c -L. -lab    # use links to library
    	./ab  # need LD_LIBRARY_PATH to include this directory
                  # many users have "." meaning "here" "this directory" in path
    
     abg : ab.c aaa.c bbb.c  # uses /usr/local/lib needs root priv
    	gcc -c -fpic -shared aaa.c
    	gcc -c -fpic -shared bbb.c
    	ld  -o libab.so -shared aaa.o bbb.o -lm -lc
    	rm -f *.o
    	cp libab.so /usr/local/lib   # install for all users
    	rm -f libab.so
    	ldconfig
    	gcc -o abg ab.c -lab         # any user can get libab.so
    	./abg   # any user has access to  libab.so
    
     clean:
    	rm -f ax
    	rm -f ab
    	rm *.a
    	rm *.so
    
    To see what is inside, gcc -S -g3 ax.c
    ax.s
    
    
    Here are some examples of addressing as seen in assembly code
    and .o or .obj files. Then in executable a.out or .exe files
    as seen through the debugger. The "relocatable" addresses are
    converted to "virtual" addresses then during execution converted
    to "physical" or RAM addresses. Coming soon to a WEB page near you.
    
    To get memory map, yuk, output, add  -Ml,-M  to  gcc -o ... command
    
    ax.map
    
    Remember, those huge addresses are virtual addresses.
    Your program may run with much smaller physical memory.
    
    
    
    
    

    Information that might help with Project part3

    Some are ready to implement part3 of the project.
    Part3 description. You may use a complete behavioral solution, just code the hit/miss process you did by hand in Homework 9, 2a. This may be based on the code below. Put the caches inside the instruction memory, part3a, and and data memory, part3b, components (entity and architecture). (you will need to pass a few extra signals in and out) Use the existing shared memory data as the main memory. Make a miss on the instruction cache cause a three cycle stall. Make a miss on the data cache cause a three cycle stall. Previous stalls from part2b must still work. Both instruction cache and data cache hold 16 words organized as four blocks of four words. Remember vhdl memory is addressed by word address, the MIPS/SGI memory is addressed by byte address and a cache is addressed by block number. The cache schematic for the instruction cache was handed out in class and shown in. icache.jpg The cache may be implemented using behavioral VHDL, basically writing sequential code in VHDL or by connecting hardware. Possible behavioral, not required, VHDL to set up the start of a cache: (no partial credit for just putting this in your cache.) -- add in or out signals to entity instruction_memory as needed -- for example, 'clk' 'clear' 'miss' architecture behavior of instruction_memory is subtype block_type is std_logic_vector(154 downto 0); type cache_type is array (0 to 3) of block_type; signal cache : cache_type := (others=>(others=>'0')); -- now we have a cache memory initialized to zero begin -- behavior inst_mem: process ... -- whatever, does not have to be just 'addr' variable quad_word_address : natural; -- for memory fetch variable cblock : block_type;-- the shaded block in the cache variable index : natural; -- index into cache to get a block variable word : natural; -- select a word variable my_line : line; -- for debug printout variable W0 : std_logic_vector(31 downto 0); ... begin ... index := to_integer(addr(5 downto 4)); word := to_integer(addr(3 downto 2)); cblock := cache(index); -- has valid (154), tag (153 downto 128) -- W0 (127 downto 96), W1(95 downto 64) -- W2(63 downto 32), W3 (31 downto 0) -- cblock is the shaded block in handout ... quad_word_address := to_integer(addr(13 downto 4)); W0 := memory(quad_word_address*4+0); W1 := memory(quad_word_address*4+1); -- ... -- fill in cblock with new words, then cache(index) <= cblock after 30 ns; -- 3 clock delay miss <= '1', '0' after 30 ns; -- miss is '1' for 30 ns ... -- the part3a.chk file has 'inst' set to zero while 'miss' is 1 -- not required but cleans up the "diff" debug: process -- used to show cache variable my_line : LINE; -- not part of working circuit begin wait for 9.5 ns; -- just before rising clock for I in 0 to 3 loop write(my_line, string'("line=")); write(my_line, I); write(my_line, string'(" V=")); write(my_line, cache_ram(I)(154)); write(my_line, string'(" tag=")); hwrite(my_line, cache_ram(I)(151 downto 128)); -- ignore top bit write(my_line, string'(" w0=")); hwrite(my_line, cache_ram(I)(127 downto 96)); write(my_line, string'(" w1=")); hwrite(my_line, cache_ram(I)(95 downto 64)); write(my_line, string'(" w2=")); hwrite(my_line, cache_ram(I)(63 downto 32)); write(my_line, string'(" w3=")); hwrite(my_line, cache_ram(I)(31 downto 0)); writeline(output, my_line); end loop; writeline(output, my_line); -- blank line wait for 0.5 ns; -- rest of clock end process debug; end architecture behavior; -- of cache_memory For debugging your cache, you might find it convenient to add this 'debug' print process inside the instruction_memory architecture: Then diff -iw part3a.out part3a_print.chk debug: process -- used to print contents of I cache variable my_line : LINE; -- not part of working circuit begin wait for 9.5 ns; -- just before rising clock for I in 0 to 3 loop write(my_line, string'("line=")); write(my_line, I); write(my_line, string'(" V=")); write(my_line, cache(I)(154)); write(my_line, string'(" tag=")); hwrite(my_line, cache(I)(151 downto 128)); -- ignore top bits write(my_line, string'(" w0=")); hwrite(my_line, cache(I)(127 downto 96)); write(my_line, string'(" w1=")); hwrite(my_line, cache(I)(95 downto 64)); write(my_line, string'(" w2=")); hwrite(my_line, cache(I)(63 downto 32)); write(my_line, string'(" w3=")); hwrite(my_line, cache(I)(31 downto 0)); writeline(output, my_line); end loop; wait for 0.5 ns; -- rest of clock end process debug; see part3a_print.chk with debug You may print out signals such as 'miss' using prtmiss from. debug.txt Change MEMread : std_logic := '1'; to MEMread : std_logic := '0'; for part3b. You submit on GL using: submit cs411 part3 part3a.vhdl Do a write through cache for the data memory. (It must work to the point that results in main memory are correct at the end of the run and the timing is correct, partial credit for partial functionality) You submit this as part3b.vhdl Cache hierarchy on a multiple core CPU. AMD quad core to six core to shared memory. 17.6 GBs front side bus, DDR-800 RAM part3b

    Lecture 25, I/O types and performance

    Take a look inside the hard drive being passed around.
    
    
    
    
    Mine is bigger than yours.
    
    

    How fast can you read a block of data?

    There are four time components that must be known to answer this question. 1) The time for the read head to get to the required track. This is seek time. 2) The time for the disk to rotate to start reading the first byte. This is the rotational delay time. 3) The time to transfer the data from the disk to your RAM. This is the transfer time. 4) Overhead that can be from software, application, OS or drivers. This is overhead time.

    Seek time

    The head may be on any track, thus there is seek time before any data can be read. The manufacturers published average seek time is standardized at the time to go from track 0 to the middle track, measured in milliseconds. In the 1990's the size of disk had become large enough such that the measured average seek time was 1/4 the published average seek time. We use 1/4 the published average seek time for our homework and exams. For your computer, having a hard drive with capacity over 120GB, I suggest using 1/8 the published average seek time for your estimates. The reason is that the files you are working with tend to cluster, thus you rarely will have a seek traveling 1/4 the tracks on the disk. For my example below, the published average seek time was 5.4 ms and thus 5.4/4 = 1.4 ms is used.

    Rotational delay time

    The disk is spinning at a known Revolutions Per Minute, RPM. We deal in seconds, thus divide the RPM by 60 to get Revolutions Per Second, RPS. How long, on average, does it take for the read head to reach data? This is the rotational delay time and only depends on the RPS. On average the time will be the time for 1/2 of a revolution, thus 1/2 * 1/RPS . Typically expressed in milliseconds, ms. Some values are: RPM RPS 1/4 * 1/RPS seconds milliseconds 3600 60 0.00417 4.17 5400 90 0.00277 2.77 7200 120 0.00208 2.08 10,025 167 0.00155 1.55 15,000 250 0.00100 1.00

    Transfer time

    The time to transfer data depends on the bandwidth, typically given in Megabytes per second. The disk drive has internal RAM and usually can deliver a continuous stream of bytes at near the maximum transfer rate. The transfer may be slowed by your computers system bus or your RAM or other contention for the system bus to RAM path. The example below uses an 80MB/s transfer rate. Thus 80MB can be transferred in one second.

    Overhead time

    The overhead time is estimated. 0.6ms

    Example

    How long does it take to read a file from disk? (example calculation) time = average seek time + average rotational delay + transfer time + overhead published average seek = 5.4 ms "average" seek = 5.4/4 = 1.4ms 10,025 RPM or 167 RPS 1/2 * 1/167 = .00299 sec = 3.0ms Overhead assumed = 0.6ms Size independent delay, sum= 5.0ms At 80 MB/sec transfer rate: 10KB 100KB 1MB 10MB 0.125 1.25 12.5 125. transfer time in ms 5.0 5.0 5.0 5.0 _____ ____ ____ _____ 5.125 6.25 17.5 130.0 ms This is a one block "first read" The next read could be buffered Notice that on small files, the latency (times 1) 2) and 3) dominate. On large files the transfer time dominates. Today, files in the tens of megabytes are common. Many years ago most files were around 10 kilobytes. Today 1 to 10 megabyte is typical. A benchmark I ran on reading 1KB, 10KB, 100KB, and 1MB of data from a 10MB file. /* time_io.c check how much is cached in ram */ /* assumed pre-existing data file time_io.dat */ /* created by running time_io_init */ #include <stdio.h> #include <time.h> int main() { FILE * handle; int i; int j; double cpu; char buf[1000000]; /* 1MB */ int check; int n = 10000; /* number of reads on 10MB file for buf1*/ int k = 1000; /* number of bytes read per read */ printf("time_io.c 10MB file, read 1KB, 10KB, 100KB, 1MB \n"); handle = fopen("time_io.dat","rb"); printf("On rebooted machine, first read \n"); cpu = (double)clock()/(double)CLOCKS_PER_SEC; for(i=0; i<n; i++) { check = fread(buf, k, 1, handle); if(check != buf[1]) printf("check failed \n"); } cpu = (double)clock()/(double)CLOCKS_PER_SEC - cpu; fclose(handle); printf("first read time %g seconds \n", cpu); for(n=10000; n>=10; n=n/10) { printf("more reads, cached? consistent? \n"); for(j=2; j<10; j++) { handle = fopen("time_io.dat","rb"); cpu = (double)clock()/(double)CLOCKS_PER_SEC; for(i=0; i<n; i++) { check = fread(buf, k, 1, handle); if(check != buf[1]) printf("check failed \n"); } cpu = (double)clock()/(double)CLOCKS_PER_SEC - cpu; fclose(handle); printf("%d read time %g seconds for %dKB block \n", j, cpu, k/1000); } k = k*10; } return 0; } /* end time_io.c */ One computers output: time_io.c 10MB file, read 1KB, 10KB, 100Kb, 1MB On rebooted machine, first read first read time 0.12 seconds more reads, cached? consistent? 2 read time 0.06 seconds for 1KB block 3 read time 0.06 seconds for 1KB block 4 read time 0.06 seconds for 1KB block 5 read time 0.06 seconds for 1KB block 6 read time 0.06 seconds for 1KB block 7 read time 0.06 seconds for 1KB block 8 read time 0.06 seconds for 1KB block 9 read time 0.05 seconds for 1KB block more reads, cached? consistent? 2 read time 0.05 seconds for 10KB block 3 read time 0.05 seconds for 10KB block 4 read time 0.04 seconds for 10KB block 5 read time 0.05 seconds for 10KB block 6 read time 0.05 seconds for 10KB block 7 read time 0.05 seconds for 10KB block 8 read time 0.05 seconds for 10KB block 9 read time 0.05 seconds for 10KB block more reads, cached? consistent? 2 read time 0.08 seconds for 100KB block 3 read time 0.07 seconds for 100KB block 4 read time 0.09 seconds for 100KB block 5 read time 0.07 seconds for 100KB block 6 read time 0.07 seconds for 100KB block 7 read time 0.06 seconds for 100KB block 8 read time 0.08 seconds for 100KB block 9 read time 0.08 seconds for 100KB block more reads, cached? consistent? 2 read time 0.09 seconds for 1000KB block 3 read time 0.09 seconds for 1000KB block 4 read time 0.09 seconds for 1000KB block 5 read time 0.09 seconds for 1000KB block 6 read time 0.11 seconds for 1000KB block 7 read time 0.10 seconds for 1000KB block 8 read time 0.10 seconds for 1000KB block 9 read time 0.10 seconds for 1000KB block Why did I reboot to run a file read test? On a computer that is not shut down, a file could remain in RAM and even partially in cache for days to weeks, if you were not using the computer. By now you should know that I do a lot of benchmarking. I ran the above program on two computers each with two operating systems with three disk types. Block 2.5GHz 2.5GHz 1GHz 1GHz Size P4 ATA 100 P4 ATA 100 ATA 66 SCSI 160 Windows XP Linux Windows 98 Linux 1KB 0.0000015 0.000001 0.000016 0.000004 10KB 0.000015 0.000010 0.000060 0.000035 100KB 0.000150 0.000100 0.000500 0.000300 1MB 0.003100 0.002000 0.005000 0.004000 Fine print: CPU time in seconds, most frequent value of eight measurements after first read. Using fopen, fread, binary block read. Each measurement read 10MB. e,g 10 blocks read at 1MB, 100 blocks read at 100KB, 10,000 blocks read at 1KB. Other than the first number that is 1.5 microseconds, the numbers can be read as integer microseconds. As expected the SCSI disk was faster than the ATA disk. Note that the faster system clock can allow the actual transfer rate to be near the maximum while a slower clock speed can limit the transfer rate. The operating system, drivers and libraries have some impact on total time. This is lumped into "overhead." Where do you find the disk specifications? Both the manufacturer and some retailers publish the disk specifications, and some prices. e.g. evolution specs 2007 hard drives, note cache, RPM, transfer rate

    Then SATA replaced ATA

    Serial ATA changed the wiring and protocol. ATA had wide flat cables. Driven by PC manufactures Dell, Gateway, HP, etc, they needed thinner cables. Thus higher speed transfer over fewer wires. Typical SATA bus maximum transfer rate is 3GB/s, 3 gigabytes per second. Similar latency, similar seek, faster transfer rate. A single drive with 500GB of storage became available at reasonable cost. A terabyte of disk storage became practical for a desktop PC. Now multiple terabyte 6Gb/s disks are available.

    Still too slow!

    Now, SSD, Solid State Disks

    Replace the rotating disk drive with NAND Flash digital logic storage. Technology explanation Performance comparisons One technical specification Transcend 128GB $229.99 TS128GSSD25S-M enclosure was needed for desktops, initially Check for latest size, speed, cost computer-SSD-search SSD Reworking the example above for time to read a file:

    Transfer time

    The time to transfer data depends on the bandwidth, typically given in Megabytes per second. The example below uses an 80MB/s transfer rate. Thus 80MB can be transferred in one second.

    Overhead time

    The overhead time is estimated. 0.6ms

    No seek time, no rotational delay time, for SSD

    Example

    How long does it take to read a file from disk? (example calculation) time = transfer time + overhead At 80 MB/sec transfer rate: 10KB 100KB 1MB 10MB 0.125 1.25 12.5 125. trans 0.6 0.6 0.6 0.6 _____ ____ ____ _____ 0.725 1.85 13.1 125.6 ms This is a one block "first read" The next read could be buffered Notice that on very small files, the overhead time dominates. On large files the transfer time dominates. Today, files in the tens of megabytes are common. Many year ago most files were around 10 kilobytes. The SSD has a speedup of 7.07 for a 10KB file. The SSD has a speedup of 1.03 for a 10MB file. Your mileage may vary. A typical desktop is executing 4,000,000 instructions per ms, millisecond. Homework 11

    Lecture 26, DVR, DVD-RW, CDR, CD-RW

    
    This lecture covers device characteristics and formats
    of CD's and DVD's
    
    It also covers aspects that bring together technology, business,
    teaming and public buying patterns.
    
    
    There are many "ports" that allow CD and DVD connection to a common PC.
    
        Parallel Port, IEEE 1284, about 2.5MB/sec
    
        USB2, Universal Serial Bus, 60MB/sec
        USB3, Universal Serial Bus, 600MB/sec some available in 2011
    
        PCI, Peripheral Component Interconnect (bus) 528MB/sec
    
        Firewire, IEEE 1394,   50MB/sec
        Firewire, IEEE 1394b, 400MB/sec
        Firewire, IEEE 1394c, 800MB/sec
    
        SCSI, Small Computer System Interconnect, 320MB/sec
        SCSI, up to                               640MB/sec
    
        ATA, Advanced Technology Attachment (commands) 160MB/sec
        SATA 150MB/sec to 300MB/sec
        SATA 3 to 750MB/sec = 6Gbit/sec
    
        Unfortunately, the fastest DVD's are much slower.
    
        CD and DVD drives can be found for many of these ports.
    
    The "media" is the physical disk and typical names are:
    
      CD   a pre recorded disk
      CDR  a blank disk that can be recorded once
      CDRW a blank disk that can be recorded many times
    
      DVD     a pre recorded disk
      DVD-R   a blank disk, dash media, that can be recorded once
      DVD+R   a blank disk, plus media, that can be recorded once
      DVD-RW  a blank disk, dash media, that can be recorded many times
      DVD+RW  a blank disk, plus media, that can be recorded many times
      DVD-RAM a blank disk, RAM media, that can be recorded many times
      
      Blu Ray DVD pre recorded or recordable
      HD DVD      pre recorded or recordable
    
    There are many formats that can be used for CD's
      Most of the varieties are audio formats.
      There is a VCD, Video CD format.
      The digital format is UDF, ISO 9660 compatible
    
    DVD's chose to have only the UDF format
      The information on a DVD or CD using UDF is directories
      and files similar to any computer file system.
      Movies use a set of files in MPEG format within the UDF file system.
    
      In Windows, Windows Explorer or prompt command  dir
      or in Linux or any Unix, the command  ls  can be
      used to look at the directory structure of the UDF file system.
    
    Here is one such listing. Note required directory name video_ts
    and required file name video_ts for a DVD to play a movie.
    
     Volume in drive E is ITALIAN_JOB
     Volume Serial Number is 4E8F-DF0F
    
     Directory of E:\
    
    08/12/2003  03:13 AM              VIDEO_TS
                   0 File(s)              0 bytes
    
     Directory of E:\VIDEO_TS
    
    08/12/2003  03:13 AM              .
    08/12/2003  03:13 AM              ..
    08/12/2003  03:13 AM            20,480 VIDEO_TS.BUP
    08/12/2003  03:13 AM            20,480 VIDEO_TS.IFO
    08/12/2003  03:13 AM           909,312 VIDEO_TS.VOB
    08/12/2003  03:13 AM            18,432 VTS_01_0.BUP
    08/12/2003  03:13 AM            18,432 VTS_01_0.IFO
    08/12/2003  03:13 AM           268,288 VTS_01_0.VOB
    08/12/2003  03:13 AM            10,240 VTS_01_1.VOB
    08/12/2003  03:13 AM            22,528 VTS_02_0.BUP
    08/12/2003  03:13 AM            22,528 VTS_02_0.IFO
    08/12/2003  03:13 AM        16,521,216 VTS_02_0.VOB
    08/12/2003  03:13 AM       387,725,312 VTS_02_1.VOB
    08/12/2003  03:13 AM            28,672 VTS_03_0.BUP
    08/12/2003  03:13 AM            28,672 VTS_03_0.IFO
    08/12/2003  03:13 AM       760,942,592 VTS_03_1.VOB
    08/12/2003  03:13 AM            79,872 VTS_04_0.BUP
    08/12/2003  03:13 AM            79,872 VTS_04_0.IFO
    08/12/2003  03:13 AM       103,512,064 VTS_04_0.VOB
    08/12/2003  03:13 AM     1,073,709,056 VTS_04_1.VOB
    08/12/2003  03:13 AM     1,073,709,056 VTS_04_2.VOB
    08/12/2003  03:13 AM     1,073,709,056 VTS_04_3.VOB
    08/12/2003  03:13 AM     1,073,709,056 VTS_04_4.VOB
    08/12/2003  03:13 AM     1,073,709,056 VTS_04_5.VOB
    08/12/2003  03:13 AM        18,653,184 VTS_04_6.VOB
    08/12/2003  03:13 AM            38,912 VTS_05_0.BUP
    08/12/2003  03:13 AM            38,912 VTS_05_0.IFO
    08/12/2003  03:13 AM     1,073,709,056 VTS_05_1.VOB
    08/12/2003  03:13 AM       343,238,656 VTS_05_2.VOB
    08/12/2003  03:13 AM            14,336 VTS_06_0.BUP
    08/12/2003  03:13 AM            14,336 VTS_06_0.IFO
    08/12/2003  03:13 AM       136,196,096 VTS_06_1.VOB
                  30 File(s)  8,210,677,760 bytes
    
         Total Files Listed:
                  30 File(s)  8,210,677,760 bytes
                   3 Dir(s)               0 bytes free
    
    
    The speeds of CD's and DVD have a large range. Generally they
    became faster as time passed and more were sold.
    
        CD                 DVD
        1X = 150KB/sec     1X =  1.38MB/sec
        2X = 300KB/sec     2X =  2.76MB/sec
       10X = 1.5MB/sec     4X =  5.52MB/sec
       20X = 3.0MB/sec     8X = 11 MB/sec
       40X = 6.0MB/sec    16X = 22 MB/sec
    
       Much slower than hard drives.
    
       Most drives can read at a higher speed than they can write.
    
    Capacity:
      The disk capacity for CD's is from 74 to 80 minutes of music or
      650MB to 700MB of digital storage in UDF file system.
    
      DVD's have a wider range of storage from 2 to 4 hours of movies or
       4.7GB  single sided single layer
       7.9GB  single sided double layer
       9.4GB  double sided single layer
      15.9GB  double sided double layer
    
      Blu Ray and HD DVD are aiming for 20 to 40 hours of conventional
      movies or 4 to 8 hours of HDTV, high definition TV, 1080i  or
      hundreds of Gigabytes. The market was not stable for a time,
      and some technology, business, teaming and buying patterns
      are covered to show where we are now.
    
    Technical information on CD's and DVD's
    
    DVD and CD Writing Technology 
    
    cont. 
    
    Reviews 
    
    Burn DVD using Linux
    
    DVD-RW
    
    Protecting, who?
    
    Sony and friends vs. Toshiba and friends
    
    Blue Ray vs. HD DVD
    
    
    
    A prototype TDK 200GB blue laser disc would be able to hold a full
    18 hours of high-definition video, the company said.
    
    
    
    
    1/5/2007
    First Combo High-Def DVD Player Announced
    Oh man, is CES going to be good! Lots of disruptive products out there,
    and I'm particularly excited about a new one from LG. The company
    promises to show off the first combo/hybrid drive for Blu-ray and HD DVD,
    possibly putting an end to the whole war for good. That's good news for
    consumers, who have mostly ignored the new discs. Our story wraps up what
    we know about LG's CES announcements, and also provides analysis as to
    what it means, and why this is so cool. Check it out, and stay tuned to
    our CES coverage all next week at www.pcmag.com/ces for all the
    breakthroughs.
    
    Get software to make your own DVD's
    
    More HV vs Blu_ray, gamers view
    
    
    Blu_ray vs HD_DVD origin
    
    
    
    

    And a final straw. Walmart said it would only carry Blu_ray. HD_DVD faded into oblivion.

    Lecture 27, Busses, I/O-processor connection

    A "bus" is just a number of wires in parallel used to transfer
    information from one device to another device. The wires may
    be built into a printed wiring board, PWB, or may be in
    a flexible cable.
    
    The most important specification for a bus is its protocol.
    The protocol defines the method for accessing the bus, read
    requests, write requests, address and data sequencing, etc.
    
    There may be many devices on a bus. In order for all the
    devices to work together, all must follow the protocol.
    
    A possible bus may have the following sets of lines.
    
    
    
    The Control lines are used to implement the protocol.
    
    There may be a bus master, hardware, that arbitrates when
    two devices want to get on the bus at the same time.
    
    When a bus has a clock, the bus is called synchronous.
    All signals change on rising edge, falling edge or both.
     
    An asynchronous bus is driven at the speed of the device
    currently driving the bus.
    
    Diagram showing how busses might be connected in a computer:
    
    
    
    The bandwidth, speed, of a bus may be measured in
      bits per second, bps     Mbps is 10^6 bps, not 2^20 bps
      bytes per second, Bps    communication is typically powers of 10
      megahertz, MHz
      words per second
      transactions per second
    
    A transaction is a complete protocol sequence.
    An example with time progressing down:
    
          Device 1                                 Device 2
     wait for bus available
     put address on bus
     set request to 1
     wait for Ack = 1, acknowledge
                                        wake up because request = 1
                                        save address from bus
                                        set Ack to 1
                                        wait for request = 0
     wake up because Ack = 1
     release address lines
     set request to 0
     wait for ready = 1
                                        wake up because request = 0
                                        set Ack to 0
                                        put data on the bus
                                        set ready to 1
                                        wait for Ack = 1
     wake up because ready = 1
     save data from bus
     set Ack to 1
     wait for ready = 0
                                        wake up because Ack = 1
                                        release data lines
                                        set ready to 0
                                        finished this transaction
     wake up because ready = 0
     set Ack to 0
     finished this transaction
     bus is available
    
    Often, the bus protocol is implemented as a Deterministic Finite
    Automata, DFA. The state diagram for the above protocol could be
    shown as:
    
    
    
    
    
    
      Examples of Busses   circa 2012 including older  (changes with time)
    
      Bus name    Max       Max      Max   width  comment
                  Mbits     MBytes   MHz
                  per sec   per sec
    
      front side  17,024    2,128    133   128    many possible
                  34,048    4,256    133   256
                  19,200    2,400    150   128
                  85,248   10,656    333   256
                 136,448   17,056    533   256
                 204,800   25,600    800   256
                 225,280   26,160    880   256
                 256,000   32,000  1,000   256
                 320,000   40,000  1,250   256    (PPC Mac G5)
                 307,200   38,400  1,600   192    (I7 extreme, 1 channel)
    
      AGP          2,112      264     66    32
      AGP8X       17,056     2,132   533    32
    
      PCI          1,056      132     33    32
      PCI          2,112      264     33    64
      PCI          2,112      264     66    32
      PCI          4,224      528     66    64
      PCI          4,224      528    133    32
      PCI          8,448    1,056    133    64
      PCIX        17,056    2,132    533    32    extended, compatible
      PCIe        64,000    8,000   2000    32    express, one way, full duplex
                                                  1,2,4,8,12,16 or 32 lanes
      ATA 100        800      100     25    32
      ATA 133       1064      133     33    32
      ATA 160       1280      160     40    32
      SATA 150      1200      150    600     2    one way, full duplex
      SATA std      1500      187   1500     1    one way, full duplex
                                                  limited by motherboard
      SATA II 300   2400      300   1200     2
      SATA II std   3000      375   3000     1    no forcing to build standard
      SATA 3.0      6000      750   6000     1
    
      SCSI 1          40        5      5     8
      SCSI 2         160       20     10    16
      SCSI 3        1280      160     80    16
      SCSI UW3      2560      320    160    16
      SCSI 320      5120      640    320    16    has cable terminators
    
      Firewire1394   400       50    400     1
      Firewire1394b  800      100    800     1    many video cameras
      Firewire S16  1600      200   1600     1
      Firewire S32  3200      400   3200     1
      Firewire S80  6400      800   6400     1
    
      USB 1.1         12        1.5   12     1    slow
      USB 2          480       60    480     1    new cable
      USB 3         3200      400   1600     2    new cable, dual differential
                    5000      625   2500     2    new connectors, optional speed
                    6400      800   3200     2    micro, mini, connectors etc.
    
      Fiberchannel  1000      125   1000     1    1062.5
      Fiberchannel  2000      250   2000     1    >mile
      Fibre 16GFC            3200  14000          full duplex 10Km
      Fibre 20GFC            5100  21000          full duplex
    
      Ethernet 10     10        1.25  10     1        
      Ethernet 100   100       12.5  100     1
      Ethernet 1Gig 1000      125   1000     1
      Ethernet 10G 10000    1,250  10000     1
    
      ISA            400       50     25    16    really old
      IEEE 1284 ECP    2.5      0.31   0.31  8    half duplex
      printer port
    
      V.90 56          0.056    0.005  0.056 1    modem, one way, full duplex
    
      OC-48          2,500                       optical cross country
      OC-192 STM64  10,000                       Optical Carrier
      OC-768 STM256 40,000  5,000    light
                  Mbps      MBps     MHz 
    
    The speed of light limits the amount of information that can be
    sent over a given distance. Many busses have length restrictions.
      Light can travel about
       300,000,000    meters per second
           300,000    meters per millisecond
               300    meters per microsecond
                 0.3  meters per nanosecond  (about 1 foot)
    
    Unchanged in last few decades. (slower inside integrated circuit)
    
    Pentium 4 busses and PCI-X vs PCIe
    
    
    Note one example of AGP being replaced by PCI-e and the mention
    of many "busses" in the advertisement:
    
    
    
    
    
    SCSI and printer port, wave forms
    
    
    For HW12, read the directions carefully. Every bus is different.
    Example of HW12 solution method
    Now you can do HW 12
    
    

    Lecture 28, Multiprocessors

    Classic problems that require multiprocessors:
    
    
    
    
    
    Maxwell's Equations
    
    
    The numerical solution of Maxwell's Equations for electro-magnetic
    fields may use a large four dimensional array with dimensions
    X, Y, Z, T. Three spatial dimensions and time.
    Relaxation algorithms map well to a four dimensional array of
    parallel processors.
    
    A 4D 12,288 node supercomputer
    
    A multiprocessor may have distributed memory, shared memory or a
    combination of both.
    
    
    
    For the distributed memory and the shared memory multiprocessors,
    one possible connection, shown as a line above, is to use an
    omega network. The basic building block of an omega network is
    a switch with two inputs and two outputs. When a message arrives
    at this switch, the first bit is stripped off and the switch is
    set to: straight through if the bit is '0' on the top input or
    '1' on the bottom input else cross connected. Note that only
    one message can pass, the other being blocked, if two messages
    arrive and the exclusive or of the first bits is not '1'.
    
    
    
    Then omega networks for connecting two devices, four devices or
    eight devices are built from this switch are shown below. The
    messages are sent with the most significant bit of the destination
    first.
    
    
    
    For 16 devices connected to the same or different 16 devices,
    the omega network is built from the primitive switch as:
    
    
    
    Note that connecting N devices requires N log_2(N) switches.
    Given a set of random connections of N devices to N devices
    with an omega network, this is mathematically a permutation,
    then statistically 1/2 N connections may be made simultaneously.
    
    
    Then, we can call a CPU-memory pair a node, reduce the drawing
    of a node to a dot, and show a few connection topologies
    for multiprocessors
    
    
    
    "Ports" is the number of I/O ports the node must have.
    "Max path" is the maximum number of hops a message must take
    in order to get from one node to the farthest node. A message
    may be as small as a Boolean signal or as large as a big
    matrix.
    
    The actual interconnect technology for those lines between
    the nodes has great variety. The lowest cost is Gigabit Ethernet
    while the best performance is with Myrinet and Infiniband.
    
    
    
    
    Now, the change 6 years later November 2012 
    Interconnect Top 500   Count  Share (%)	
    
    Gigabit Ethernet        159    31.8
    Infiniband QDR	        106    21.2
    Infiniband               59    11.8
    Custom Interconnect      46     9.2		
    Infiniband FDR	         45     9.0		
    10G Ethernet	         30     6.0		
    Cray Gemini interconnect 15     3.0	
    Proprietary              11     2.2		
    Infiniband DDR		  9     1.8	
    Aries interconnect 	  4     0.8	
    Infinband DDR 4x	  4     0.8	
    XT4 Internal Interconnect 4     0.8	
    Tofu interconnect         3     0.6	
    Myrinet 10G		  3     0.6  	
    Infiniband QDR Sun M9     1     0.2  new 100Gb/sec Ethernet
    Mellanox 100G
    
    One measure of a multiprocessors communication capability is
    "bisection bandwidth". Optimally choose to split the processors
    into two equal groups and measure the maximum bandwidth that
    may be obtained between the groups.
    
    Many modern multiprocessors are "clusters." Each node has a CPU,
    RAM, hard drive and communication hardware. The CPU may be dual
    or quad core and each CPU is considered a processor that may be
    assigned tasks. There is no display, keyboard, sound or graphics.
    The physical form factor is often a "blade" about 2 inches thick,
    8 inches high and 12 inches deep with slide in connectors on the back.
    A blade may have multiple CPU chips each with multiple cores.
    40 or more blades may be on one rack. Upon power up, each blade
    loads its operating system and applications from its local disk.
    
    There is still a deficiency in some multiprocessor and multi core
    operating systems. The OS will move a running program from one
    CPU to another rather than leave a long running program and its
    cache contents on one processor. Communication between multiprocesses
    may actually go out of a communication port and back into a
    communication processor when the processors are physically connected
    to the same RAM, rather than use memory to memory communication.
    
    Another classification of multiprocessors is:
    SISD Single Instruction Single Data (e.g. old computer)
    SIMD Single Instruction Multiple Data (e.g. MASSPAR, CELL, GPU)
    MIMD Multiple Instruction Multiple Data (e.g. cluster)
    
    GPU stands for graphics processing unit, e.g. your graphics
    card that may have as many as 500 cores. Some of these cards
    have full IEEE double precision floating point in every core.
    There may be groups of cores that are SIMD and thus a group
    may be MIMD. 
    
    There are three main problems with massively parallel multiprocessors:
    software, software and software.
    
    The operating systems are marginally useful for multiprogramming where
    a single program is to be run on a single data set using all the nodes
    and all the memory. Today, the OS is almost no help and the programmer
    must plan and program each node and every data transfer between nodes.
    
    The programming languages are of little help. Java threads and
    Ada tasks are not guaranteed to run on individual processors.
    Posix threads are difficult to use and control.
    MPI and VPM libraries allow the programmer to specifically allocate
    tasks to nodes and control communication at the expense of significant
    programming effort.
    
    Then there are programming classifications:
    SPSD Single program Single Data (Conventional program)
    SPMD Single Program Multiple Data (One program with "if" on all processors) 
    MPMD Multiple Program Multiple Data (Each processor has a unique program)
    
    MPI Message Passing Interface is one of the SPMD toolkits that make
    programming distributed memory multiprocessors practical,
    yet still not easy.
    
    There is a single program that runs on all processors with the allowance
    for if-then-else code dependent on processor number. The processor
    number may also be used for index and other calculations.
    My CMSC 455 lecture on MPI
    
    For shared memory parallel programming, threads are used, with
    one thread typically assigned to each cpu.
    
    Only a small percent of application are in the class of
    "embarrassingly parallel". Most applications require significant
    design effort to obtain significant "speedup".
    
    
    Yes, Amdahl's law applies to multiprocessors.
    Given a multiprocessor with N nodes, the maximum speedup
    to be expected compared to a single processor of the same type
    as the node, is N. That would imply that 100% of the program
    could be made parallel.
    
    Given 32 processors and 50% of the program can be made fully parallel,
    25% of the program can use half the processors and the rest of the program
    must run sequentially, what is the speedup over one sequential processor?
    
    Time sequentially is 100%                                     100%
                             50%   25%   25%            speedup = ------ = 3.55
    Time multiprocessing is  --- + --- + --- = 28.125%            28.125%
                             32    16    1
    
    far from the theoretical maximum of 32!
    
    Note: "fully parallel" means the speedup factor is the number of processors.
          "half the processors" in this case is 32/2 = 16.
          the remaining 25% is sequential, thus factor = 1
    
    
    
    Given 32 processors and 99% of the program can be fully parallel,
    
    Time sequentially is 100%                              100%
                             99%   1%            speedup = ------ = 24.4
    Time multiprocessing is  --- + -- = 4.1%               4.1%
                             32    1
    
    about 3/4 the theoretical maximum of 32!
    
    
    These easy calculations are only considering processing time.
    In many programs there is significant communication time to
    get the data to the required node and get the results to
    the required node. A few programs may require more communication
    time than computation time.
    
    
    Consider a 1024 = 2^10 node multiprocessor.
    Add 1,048,576 = 2^20 numbers as fast as possible on this multiprocessor.
    Assume no communication cost (very unreasonable)
       step  action
          1  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
          2  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
        ...
    2^9=512  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
    
             (so far fully parallel, now have only 2^19 numbers to add)
    
    2^9+1    add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
    2^9+2    add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
      ...
    2^9+2^8  add 2^10 numbers to 2^10 numbers getting 2^10 partial sums
    
             (so far fully parallel, now have only 2^18 numbers to add)
    
    see the progression:
    2^9 + 2^8 + 2^7 + ... 2^2 + 2^1 + 2^0 = 1023 time steps
             and we now have 2^10 partial sums, thus only 2^9 or 512
             processors can be used on the next step
    
    1024    add 2^9 numbers to 2^9 numbers getting 2^9 partial sums
            (using 1/2 the processors)
    1025    add 2^8 numbers to 2^8 numbers getting 2^8 partial sums
            (using 1/4 the processors)
     ...    
    1033    add 2^0=1 number to 2^0=1 number to get the final sum
            (using 1 processor)
    
                         sequential time   1,048,575
    Thus our speedup is  --------------- = ----------- = 1015
                         parallel time       1033
    
    The percent utilization is 1015/1024 * 100% = 99.12%
    
    Remember: Every program has a last, single, instruction to execute.
    Jack Dongarra, an expert in the field of multiprocessor programming
    says "It just gets worse as you add more processors."
    
    
    Top 500 multiprocessors:
    These have been and are evaluated by the Linpack Benchmark.
    Heavy duty numerical computation. This Benchmark is close to
    "embarrassingly parallel" and thus there is the start of a move
    to the Graph 500 Benchmark that more fully measures the
    interconnection capacity of the highly parallel machine.
    Graph500
    
    Some history of the top500:
    www.top500.org/lists/2006/06
    www.top500.org/list/2007/11/100
    www.top500.org/lists/2008/11
    www.top500.org/list/2015/06
    Over 1 million cores, over 12 megawatts of power.
    exascale
    
    Gemini interconnect trying to solve the biggest problem
    
    Latest VA Tech Machine
    
    Test your dual core, quad core, 8, 12 to be sure your operating
    system is assigning threads to different cores.
    time_mp2.c
    time_mp4.c
    time_mp8.c
    time_mp12.c
    time_mp12_c.out
    
    Here is a graph of Amdahl speedup for increasing number of processors,
    for 50%, 75%, 90% and 95% parallel execution.
    As the curves flatten out, more processors or cores are useless.
    
    
    
    Tabular data
    
    

    Project part3a hints diff1.png diff2.png

    Lecture 29, Review

    
    Covered on web: Previous Final Exam and Answers
    
    Read over course WEB pages. (some have been updated)
    
    Work all homeworks. (some similar problems on exam)
    
    Do project at least through part2b. (some questions on exam)
    
    

    Lecture 30, Final Exam

    
      Open book, open note, download, edit, submit
      Do not guess, you can look up the answer.
      You may think you know the answer because you saw the
      question before. "no" or "not" may have been added or
      deleted. "some" and "all" are different.
      Numbers and names can change.
      My goal is to make you read carefully so you do
      good on your first employment.
      
      Edit by placing an  x  after  a)  b)  c)  that is your answer.
      OK to highlight answer.
      Only one answer per question!
      Edit with Microsoft Word on Windows, libreoffice on linux.gl
      
      Finish homework and projects.
      
      Students with email user name starting  a b c d e f g h i
      download and edit  final33a.doc
      download final33a.doc 
    
    
      Students with email user name starting  j k l m n o p q
      download and edit  final33b.doc
      download final33b.doc 
    
    
      Students with email user name starting  r s t u v w x y z
      download and edit  final33c.doc
      download final33c.doc 
    
      Follow instructions in exam, edit, then
      submit  cs411  final  final32?.doc 
    
      You can do the exam on linux.gl.umbc.edu in your directory
    
      cp /afs/umbc.edu/users/s/q/squire/pub/download/final33?.doc .
      libreoffice final33?.doc
      submit cs411 final final33?.doc
    
      rm final33?.doc only if over quota
      Everything due by Dec  12,2020
      Due date changed on a,b,c on Dec 9,2020  same exam.
      
      Before Exam:
      Review HW2, HW3, HW4 (VHDL) and HW5
      Review WEB Lecture's 14 through 29.
    
      There are  10  types of people:
        Those who know binary.
        Those who do not know binary.
        Bit numbers start with zero for least significant bit.
        In most languages, the first index is zero.
      
      Teach your children to count in the computer age:
        zero
        one
        two
        three
        four
    
      Computer bits are numbered from the bottom
    
        0  0  1  0  1  = 5
        4  3  2  1  0    bit numbers (actually powers of 2)
      
    
    last updated Dec 9, 2020

    Last updated 10/28/09