CMPE 310 Selected Lecture Notes

This is one big WEB page, used for printing

 These are not intended to be complete lecture notes.
 Complicated figures or tables or formulas are included here
 in case they were not clear or not copied correctly in class.
 Source code may be included in line or by a link.

 Lecture numbers correspond to the syllabus numbering.

Lecture 1 Introduction, Number Systems

Lecture 2 Getting and using NASM

Lecture 3 Registers, Syntax and sections

Lecture 4 Arithmetic and shifting

Lecture 5 Using Debugger

Lecture 6 Branching and loops

Lecture 7 Subroutines

Lecture 8 Boot programs and 16-bit

Lecture 9 Syscall and BIOS calls

Lecture 10 Hardware interface

Lecture 11 Privileged instructions

Lecture 12 Linux kernel calls

Lecture 13 Review

Lecture 14 Mid term exam

Lecture 15 Memory hardware organization

Lecture 16 Memory decoding and wiring

Lecture 17 Memory RAM, DRAM

Lecture 18 Memory DRAM, DDR, Flash

Lecture 19 Input Output wiring

Lecture 20 Input Output devices

Lecture 21 Input Output 3 more devices

Lecture 22 Hardware interrupts

Lecture 23 Disc Drum CD

Lecture 24 Busses

Lecture 25 Protected mode addressing

Lecture 26 Virtual memory paging hardware

Lecture 27 Arithematic Logic Unit

Lecture 28 Architecture

Lecture 29 Review

Lecture 30 Final Exam

Lecture 1 Introduction


First look at a computer architecure



Intel block diagram

You should be familiar with programming.
You edit your source code and have it on the disc.
A compiler reads your source code and typically converts
high level language to assembly language as another file on the disc.
The assembler reads the assembly language and produces a
binary object file with machine instructions.
The loader reads object files and creates an executable image.

This course is to provide a basic understanding of how computers
operate internally, e.g. computer architecture and assembly language.
Technically: The computer does not run a "program", the computer
has an operating system that runs a "process". A process starts
with loading the executable image of a program in memory.
A process sets up "segments" of memory with:
A ".text"   segment with computer instructions
A ".data"   segment with initialized data
A ".rodata" segment with initialized data, read only
A ".bss"    segment for variables and arrays
A "stack"   for pushing and popping values
A "heap"    for dynamically getting more memory
And then the process is executed by having the program
address register set to the first executable instruction
in the process. You will be directly using segments in
your assembly language programs.

Computers store bits, binary digits, in memory and we usually
read the bits, four at a time as hexadecimal. The basic unit
of storage in the computer is two hex digits, eight bits, a byte.
The data may be integers, floating point or characters.
We start this course with a thorough understanding of numbers.

Numbers are represented as the coefficients of powers of a base.
(in plain text, we use "^" to mean, raise to power or exponentiation)

With no extra base indication, expect decimal numbers:

         12.34   is a representation of

  1*10^1 + 2*10^0 + 3*10^-1 + 4*10^-2  or

     10
      2
       .3
    +  .04
    ------
     12.34


Binary numbers, in NASM assembly language, have a trailing B or b.

     101.11B  is a representation of

  1*2^2 + 0*2^1 + 1*2^0 + 1*2^-1 + 1*2^-2   or

     4
     0
     1
      .5        (you may compute 2^-n or look up in table below)
   +  .25
   ------
     5.75

Converting a decimal number to binary may be accomplished:

   Convert  12.34  from decimal to binary

   Integer part                      Fraction part
        quotient remainder                integer fraction
   12/2 =   6       0              .34*2 =      0.68
    6/2 =   3       0              .68*2 =      1.36
    3/2 =   1       1              .36*2 =      0.72
    1/2 =   0       1              .72*2 =      1.44
    done                           .44*2 =      0.88
    read up  1100                  .88*2 =      1.76
                                   .76*2 =      1.52
                                   .52*2 =      1.04
                                   quit
                                   read down   .01010111
    answer is  1100.01010111


  Powers of 2
                   Decimal
                 n         -n
                2    n    2
                 1   0   1.0 
                 2   1   0.5 
                 4   2   0.25 
                 8   3   0.125 
                16   4   0.0625 
                32   5   0.03125 
                64   6   0.015625 
               128   7   0.0078125 
               256   8   0.00390625
               512   9   0.001953125
              1024  10   0.0009765625 
              2048  11   0.00048828125 
              4096  12   0.000244140625 
              8192  13   0.0001220703125 
             16384  14   0.00006103515625 
             32768  15   0.000030517578125 
             65536  16   0.0000152587890625 

For binary to decimal:

   2^3  2^2  2^1  2^0  2^-1  2^-2  2^-3
    1    1    1    1 .  1     1     1

    8 +  4 +  2 +  1 + .5 +  .25 + .125 = 15.875

 
                   Binary
                 n         -n
                2    n    2
                 1   0   1.0 
                10   1   0.1
               100   2   0.01 
              1000   3   0.001 
             10000   4   0.0001 
            100000   5   0.00001 
           1000000   6   0.000001 
          10000000   7   0.0000001 
         100000000   8   0.00000001
        1000000000   9   0.000000001
       10000000000  10   0.0000000001 
      100000000000  11   0.00000000001 
     1000000000000  12   0.000000000001 
    10000000000000  13   0.0000000000001 
   100000000000000  14   0.00000000000001 
  1000000000000000  15   0.000000000000001 
 10000000000000000  16   0.0000000000000001 


                  Hexadecimal
                 n         -n
                2    n    2
                 1   0   1.0 
                 2   1   0.8
                 4   2   0.4 
                 8   3   0.2 
                10   4   0.1 
                20   5   0.08 
                40   6   0.04 
                80   7   0.02 
               100   8   0.01
               200   9   0.008
               400  10   0.004 
               800  11   0.002 
              1000  12   0.001 
              2000  13   0.0008 
              4000  14   0.0004 
              8000  15   0.0002 
             10000  16   0.0001 

Decimal to Hexadecimal to Binary, 4 bits per hex digit
   0         0            0000
   1         1            0001
   2         2            0010
   3         3            0011
   4         4            0100
   5         5            0101
   6         6            0110
   7         7            0111
   8         8            1000
   9         9            1001
  10         A            1010
  11         B            1011
  12         C            1100
  13         D            1101
  14         E            1110
  15         F            1111
             
        n                       n
    n  2  hexadecimal          2  decimal  approx  notation
   10             400               1,024   10^3   K kilo
   20          100000           1,048,576   10^6   M mega
   30        40000000       1,073,741,824   10^9   G giga
   40     10000000000   1,099,511,627,776   10^12  T tera

The three representations of negative numbers that have been
used in computers are  twos complement,  ones complement  and
sign magnitude. In order to represent negative numbers, it must
be known where the "sign" bit is placed. All modern binary
computers use the leftmost bit of the computer word as a sign bit.

The examples below use a 4-bit register to show all possible
values for the three representations.

 decimal   twos complement  ones complement  sign magnitude
       0      0000            0000             0000
       1      0001            0001             0001
       2      0010            0010             0010
       3      0011            0011             0011
       4      0100            0100             0100
       5      0101            0101             0101
       6      0110            0110             0110
       7      0111            0111             0111
      -7      1001            1000             1111
      -6      1010            1001             1110
      -5      1011            1010             1101
      -4      1100            1011             1100
      -3      1101            1100             1011
      -2      1110            1101             1010
      -1      1111            1110             1001
          -8  1000        -0  1111         -0  1000
                  ^           /                ^||| 
                   \_ add 1 _/          sign__/ --- magnitude

To get the sign magnitude, convert the decimal to binary and
place a zero in the sign bit for positive, place a one in the
sign bit for negative.

To get the ones complement, convert the decimal to binary,
including leading zeros, then invert every bit. 1->0, 0->1.

To get the twos complement, get the ones complement and add 1.
(Throw away any bits that are outside of the register)

It may seem silly to have a negative zero, but it is
mathematically incorrect to have -(-8) = -8





IEEE Floating point formats
Almost all Numerical Computation arithmetic is performed using
IEEE 754-1985 Standard for Binary Floating-Point Arithmetic.
The two formats that we deal with in practice are the 32 bit and
64 bit formats.

IEEE Floating-Point numbers are stored as follows:
The single format 32 bit has
    1 bit for sign,  8 bits for exponent, 23 bits for fraction
The double format 64 bit has
    1 bit for sign, 11 bits for exponent, 52 bits for fraction

There is actually a '1' in the 24th and 53rd bit to the left
of the fraction that is not stored. The fraction including
the non stored bit is called a significand.

The exponent is stored as a biased value, not a signed value.
The 8-bit has 127 added, the 11-bit has 1023 added.
A few values of the exponent are "stolen" for
special values, +/- infinity, not a number, etc.

Floating point numbers are sign magnitude. Invert the sign bit to negate.

Some example numbers and their bit patterns:

   decimal
stored hexadecimal sign exponent  fraction                 significand 
                   bit                                     in binary
                                 The "1" is not stored 
                                 |                                   biased    
                    31  30....23  22....................0            exponent
   1.0
3F 80 00 00          0  01111111  00000000000000000000000  1.0   * 2^(127-127) 

   0.5
3F 00 00 00          0  01111110  00000000000000000000000  1.0   * 2^(126-127)

   0.75
3F 40 00 00          0  01111110  10000000000000000000000  1.1   * 2^(126-127)

   0.9999995
3F 7F FF FF          0  01111110  11111111111111111111111  1.1111* 2^(126-127)

   0.1
3D CC CC CD          0  01111011  10011001100110011001101  1.1001* 2^(123-127)
 

                          63  62...... 52  51 .....  0
   1.0
3F F0 00 00 00 00 00 00    0  01111111111  000 ... 000  1.0    * 2^(1023-1023)

   0.5
3F E0 00 00 00 00 00 00    0  01111111110  000 ... 000  1.0    * 2^(1022-1023)

   0.75
3F E8 00 00 00 00 00 00    0  01111111110  100 ... 000  1.1    * 2^(1022-1023)

   0.9999999999999995
3F EF FF FF FF FF FF FF    0  01111111110  111 ...      1.11111* 2^(1022-1023)

   0.1
3F B9 99 99 99 99 99 9A    0  01111111011  10011..1010  1.10011* 2^(1019-1023)
                                                                           |
                        sign   exponent      fraction                      |
                                                before storing subtract bias

Note that an integer in the range 0 to 2^23 -1 may be represented exactly.
Any power of two in the range -126 to +127 times such an integer may also
be represented exactly. Numbers such as 0.1, 0.3, 1.0/5.0, 1.0/9.0 are
represented approximately. 0.75 is 3/4 which is exact.
Some languages are careful to represent approximated numbers
accurate to plus or minus the least significant bit.
Other languages may be less accurate.

The operations of add, subtract, multiply and divide are defined as:

  Given   x1 = 2^e1 * f1
          x2 = 2^e2 * f2  and e2 <= e1

  x1 + x2 = 2^e1 *(f1 + 2^-(e1-e2) * f2)  f2 is shifted then added to f1

  x1 - x2 = 2^e1 *(f1 - 2^-(e1-e2) * f2)  f2 is shifted then subtracted from f1

  x1 * x2 = 2^(e1+e2) * f1 * f2

  x1 / x2 = 2^(e1-e2) * (f1 / f2)

  an additional operation is usually needed, normalization.
  if the resulting "fraction" has digits to the left of the binary
  point, then the fraction is shifted right and one is added to
  the exponent for each bit shifted until the result is a fraction.

IEEE 754 Floating Point Standard

Strings of characters
We will use one of many character representations for
character strings, ASCII, one byte per character in a string.

symbol or name                            symbol or key stroke
    key stroke
       hexadecimal                            hexadecimal
          decimal                                 decimal
  
NUL ^@ 00   0   Spc 20  32   @   40  64   `   60  96
SOH ^A 01   1   !   21  33   A   41  65   a   61  97
STX ^B 02   2   "   22  34   B   42  66   b   62  98
ETX ^C 03   3   #   23  35   C   43  67   c   63  99
EOT ^D 04   4   $   24  36   D   44  68   d   64  100
ENQ ^E 05   5   %   25  37   E   45  69   e   65  101
ACK ^F 06   6   &   26  38   F   46  70   f   66  102
BEL ^G 07   7   '   27  39   G   47  71   g   67  103
BS  ^H 08   8   (   28  40   H   48  72   h   68  104
TAB ^I 09   9   )   29  41   I   49  73   i   69  105
LF  ^J 0A  10   *   2A  42   J   4A  74   j   6A  106
VT  ^K 0B  11   +   2B  43   K   4B  75   k   6B  107
FF  ^L 0C  12   ,   2C  44   L   4C  76   l   6C  108
CR  ^M 0D  13   -   2D  45   M   4D  77   m   6D  109
SO  ^N 0E  14   .   2E  46   N   4E  78   n   6E  110
SI  ^O 0F  15   /   2F  47   O   4F  79   o   6F  111
DLE ^P 10  16   0   30  48   P   50  80   p   70  112
DC1 ^Q 11  17   1   31  49   Q   51  81   q   71  113
DC2 ^R 12  18   2   32  50   R   52  82   r   72  114
DC3 ^S 13  19   3   33  51   S   53  83   s   73  115
DC4 ^T 14  20   4   34  52   T   54  84   t   74  116
NAK ^U 15  21   5   35  53   U   55  85   u   75  117
SYN ^V 16  22   6   36  54   V   56  86   v   76  118
ETB ^W 17  23   7   37  55   W   57  87   w   77  119
CAN ^X 18  24   8   38  56   X   58  88   x   78  120
EM  ^Y 19  25   9   39  57   Y   59  89   y   79  121
SUB ^Z 1A  26   :   3A  58   Z   5A  90   z   7A  122
ESC ^[ 1B  27   ;   3B  59   [   5B  91   {   7B  123
LeftSh 1C  28   <   3C  60   \   5C  92   |   7C  124
RighSh 1D  29   =   3D  61   ]   5D  93   }   7D  125
upAro  1E  30   >   3E  62   ^   5E  94   ~   7E  126
dnAro  1F  31   ?   3F  63   _   5F  95   DEL 7F  127



Optional future installation on your personal computer
Throughout this course, we will be writing some assembly language.
This will be for an Intel or Intel compatible computer, e.g. AMD.
The assembler program is "nasm" and can be run on
linux.gl.umbc.edu or on your computer.
If you are running linux on your computer, the command

sudo apt-get install nasm

will install nasm on your computer.

Throughout this course we will work with digital logic and
cover basic VHDL and verilog languages for describing
digital logic. There are free simulators, that will
simulate the operation of your digital logic for both languages
and graphical input simulator  logisim.
The commands for installing these on linux are:

sudo apt-get install freehdl
or use Makefile_vhdl from my download directory on linux.gl.umbc.edu

sudo apt-get install iverilog
or use Makefile_verilog from my download directory on linux.gl.umbc.edu

from  www.cburch.com/logisim/index.html   get logisim 
or use Makefile_logisim from my download directory on linux.gl.umbc.edu

These or similar programs may be available for installing
on some versions of Microsoft Windows or Mac OSX.

We will use 64-bit in this course, to expand your options.
In "C" int remains a 32-bit number although we have 64-bit computers
and 64-bit operating systems and 64-bit computers that are still
programmed as 32-bit computers.
test_factorial.c uses int, outputs:
test_factorial.c using int, note overflow
 0!=1 
 1!=1 
 2!=2 
 3!=6 
 4!=24 
 5!=120 
 6!=720 
 7!=5040 
 8!=40320 
 9!=362880 
10!=3628800 
11!=39916800 
12!=479001600 
13!=1932053504 
14!=1278945280 
15!=2004310016 
16!=2004189184 
17!=-288522240 
18!=-898433024 
test_factorial_long.c uses long int, outputs:
test_factorial_long.c using long int, note overflow
 0!=1 
 1!=1 
 2!=2 
 3!=6 
 4!=24 
 5!=120 
 6!=720 
 7!=5040 
 8!=40320 
 9!=362880 
10!=3628800 
11!=39916800 
12!=479001600 
13!=6227020800 
14!=87178291200 
15!=1307674368000 
16!=20922789888000 
17!=355687428096000 
18!=6402373705728000 
19!=121645100408832000 
20!=2432902008176640000 
21!=-4249290049419214848 
22!=-1250660718674968576 

Well, 13! wrong vs 21! wrong may not be a big deal.

factorial.py by default, outputs:
factorial(0)= 1
factorial(1)= 1
factorial(2)= 2
factorial(3)= 6
factorial(4)= 24
factorial(52)= 80658175170943878571660636856403766975289505440883277824000000000000

Yet, 32-bit signed numbers can only index 2GB of ram, 64-bit are
needed for computers with 4GB, 8GB, 16GB, 32GB etc of ram, available today.
95% of all supercomputers, called HPC, are 64-bit running Linux.

First homework assigned
on web, www.cs.umbc.edu/~squire/cmpe310_hw.shtml
Due in one week. Best to do right after lecture.

Lecture 2 Getting and using NASM

A 64-bit architecture, by definition, has 64-bit integer registers.
Here are sample programs and output to test for 64-bit capability in gcc:
Get sizeof on types and variables big.c
output from  gcc -m64 big.c  big.out
malloc more than 4GB  big_malloc.c
output from  big_malloc_mac.out
Newer Operating Systems and compilers
Get sizeof on types and variables big12.c
output from  gcc big12.c  big12.out
To bring everyone into the 64-bit world, we will use all 64-bit programs.

A note about the Intel computer architecture and the
tradeoff between "upward compatibility" and "clinging to the past".
Intel built a  4-bit computer 4004.
Intel built an 8-bit, byte, computer 8008
Intel built a 16-bit, word, computer 8086
Intel built a 32-bit, double word, computer 80386
Intel builds  64-bit, quad word, computers X86-64
The terms byte, word, double word, and quad word remain today
in the software we will write for modern 64-bit computers.

Learning a new programming language is an orderly progression
of steps.
1) Find sample code that you can compile and run to get output
   This is typically  hello_world  or just hello
   Then more output, we use  "C" printf  printf1.asm  printf2.asm
2) Find sample code for defining data of the types supported
   We use  testdata.asm for Nasm assembly language
3) Find sample code to do integer arithmetic
   We use  intarith.asm
4) Find sample code to do floating point arithmetic
   We use  fltarith.asm
5) Find sample code to write a function and call a function
   We use  fib.asm or test_factorial.asm
Along the way, you will see the structure of typical code,
and initialization and termination of typical programs,
creating "if" and "loop" constructs. Then you can
"cut-and-paste" existing code, modify for your program.

Computer access for this course
NASM is installed on  linux.gl.umbc.edu  and can be used there.

From anywhere that you can reach the internet, log onto your
UMBC account using:

    ssh  your-user-id@linux.gl.umbc.edu
    your-password


You should set up a directory for CMPE 310 and keep all your
course work in one directory.

  e.g.    mkdir  cmpe310  # only once
          cd  cm310     # change directory each time for CMPE 310

Copy over a sample program to your directory using:

 cp /afs/umbc.edu/users/s/q/squire/pub/download/hello.asm  .


Assemble hello.asm using:

  nasm -f elf64 hello.asm

Link to create an executable using:

  gcc -m64 -o hello  hello.o

Execute the program using:

  hello  or  ./hello

Assembly Language
Assembly Language is written as lines, rather than statements.
A semicolon makes the rest of a line a comment.
A line may be blank, a comment, a machine instruction or
an assembler directive called a pseudo instruction.
An optional label may start a line with a colon.
An assembly language program can run on a bare computer,
can run directly on an operating system, or can run using
a compiler and associated libraries. We will use a C compiler
and libraries for convenience.

A big difference between assembly language and compiler code
is that a label for a variable in assembly language is an address
while a name of a variable in compiler code is the value.

Assembly language programmers are very frugal.
They typically minimize storage space and time.
e.g. the instructions  xor rax,rax    mov rax,0  do the same
thing, zero register rax yet the xor is a little faster.
e.g. many variables are never stored in RAM, they keep values in registers.
I will avoid many of these "tricks" for a while.
(Files are available as hello.asm and hello_64.asm,
 the  _64  is to emphasize we will use all 64-bit values and registers.
 Usually there is also a C language file, e.g. hello.c )

First example  hello.asm
Now look at the file hello.asm

; hello.asm       print a string using printf
; Assemble:	  nasm -f elf64 -l hello.lst  hello.asm
; Link:		  gcc -m64 -o hello  hello.o
; Run:		  ./hello > hello.out
; Output:	  cat hello.out

; Equivalent C code
; // hello.c
; #include <stdio.h>
; int main()
; {
;   char msg[] = "Hello world";
;   printf("%s\n",msg);
;   return 0;
; }
	
; Declare needed C  functions
        extern	printf		; the C function, to be called

        section .data		; Data section, initialized variables
msg:	db "Hello world", 0	; C string needs 0
fmt:    db "%s", 10, 0          ; The printf format, "\n",'0'

        section .text           ; Code section.

        global main		; the standard gcc entry point
main:				; the program label for the entry point
        push    rbp		; set up stack frame, must be aligned
	
	mov	rdi,fmt		; pass format
	mov	rsi,msg		; pass first parameter
	mov	rax,0		; or can be  xor  rax,rax
        call    printf		; Call C function

        pop     rbp		; restore stack 

	mov	rax,0		; normal, no error, return value
	ret			; return


Makefile_nasm	
Now, to save yourself typing, download  Makefile_nasm into
your cmpe310 directory. There will be more sample files to download.

cp /afs/umbc.edu/users/s/q/squire/pub/download/Makefile_nasm  Makefile
make  # look in Makefile to see how to add more files to run

Type  make  # to run Makefile, only changed stuff gets run
Type  make -f Makefile_nasm  # only changed stuff gets run


Variable Data and Storage allocation, sections 
There can be many types of data in the  ".data" section:
Look at the file testdata.asm
and see the results in testdata.lst

; testdata.asm  a program to demonstrate data types and values
; assemble:	nasm -f elf64 -l testdata.lst  testdata.asm
; link:		gcc -m64 -o testdata  testdata.o
; run:	        ./testdata
; Look at the list file, testdata.lst
; no output
; Note! nasm ignores the type of data and type of reserved
; space when used as memory addresses.
; You may have to use qualifiers BYTE, WORD, DWORD or QWORD

	section .data		; data section
				; initialized, writeable

				; db for data byte, 8-bit 
db01:	db	255,1,17	; decimal values for bytes
db02:	db	0xff,0ABh	; hexadecimal values for bytes
db03:	db	'a','b','c'	; character values for bytes
db04:	db	"abc"		; string value as bytes 'a','b','c'
db05:	db	'abc'		; same as "abc" three bytes
db06:	db	"hello",13,10,0 ; "C" string including cr and lf

				; dw for data word, 16-bit
dw01:	dw	12345,-17,32	; decimal values for words
dw02:	dw	0xFFFF,0abcdH	; hexadecimal values for words
dw03:	dw	'a','ab','abc'	; character values for words
dw04:	dw	"hello"		; three words, 6-bytes allocated

				; dd for data double word, 32-bit
dd01:	dd	123456789,-7	; decimal values for double words
dd02:	dd	0xFFFFFFFF	; hexadecimal value for double words
dd03:	dd	'a'		; character value in double word
dd04:	dd	"hello"		; string in two double words
dd05:	dd	13.27E30	; floating point value 32-bit IEEE

				; dq for data quad word, 64-bit
dq01:	dq	123456789012,-7	; decimal values for quad words
dq02:	dq	0xFFFFFFFFFFFFFFFF ; hexadecimal value for quad words
dq03:	dq	'a'		; character value in quad word
dq04:	dq	"hello_world"	; string in two quad words
dq05:	dq	13.27E300	; floating point value 64-bit IEEE

				; dt for data ten of 80-bit floating point
dt01:	dt	13.270E3000	; floating point value 80-bit in register


	section .bss		; reserve storage space
				; uninitialized, writeable
	
s01:	resb	10		; 10 8-bit bytes reserved
s02:	resw	20		; 20 16-bit words reserved
s03:	resd	30		; 30 32-bit double words reserved
s04:	resq	40		; 40 64-bit quad words reserved
s05:	resb	1		; one more byte
	
	SECTION .text		; code section
        global main		; make label available to linker 
main:				; standard  gcc  entry point

	push	rbp		; initialize stack
	
	mov	al,[db01]	; correct to load a byte
	mov	ah,[db01]	; correct to load a byte
	mov	ax,[dw01]	; correct to load a word
	mov	eax,[dd01]	; correct to load a double word
	mov	rax,[dq01]	; correct to load a quad word

	mov	al,BYTE [db01]	; redundant, yet allowed
	mov	ax,[db01]	; no warning, loads two bytes
	mov	eax,[dw01]	; no warning, loads two words
	mov	rax,[dd01]	; no warning, loads two double words

;	mov	ax,BYTE [db01]	; error, size miss match
;	mov	eax,WORD [dw01]	; error, size miss match
;	mov	rax,WORD [dd01]	; error, size miss match

;	push	BYTE [db01]	; error, can not push a byte
	push	WORD [dw01]	; "push" needs to know size 2-byte
;	push	DWORD [dd01]	; error, can not push a 4-byte
	push	QWORD [dq01]	; error, can not push a quad word

;	push	eax		; error, wrong size, need 64-bit
	push	rax
	
	fld	DWORD [dd05]	; floating load 32-bit
	fld	QWORD [dq05]	; floating load 64-bit
		
	mov	rbx,0		; exit code, 0=normal
	mov	rax,1		; exit command to kernel
	int	0x80		; interrupt 80 hex, call kernel

; end testdata.asm


Widen your browser window, part of testdata.lst
to see addresses and data values in hexadecimal.

     1                                  ; testdata.asm  a program to demonstrate data types and values
     2                                  ; assemble:	nasm -f elf64 -l testdata.lst  testdata.asm
     3                                  ; link:		gcc -m64 -o testdata  testdata.o
     4                                  ; run:	        ./testdata
     5                                  ; Look at the list file, testdata.lst
     6                                  ; no output
     7                                  ; Note! nasm ignores the type of data and type of reserved
     8                                  ; space when used as memory addresses.
     9                                  ; You may have to use qualifiers BYTE, WORD, DWORD or QWORD
    10                                  
    11                                  	section .data		; data section
    12                                  				; initialized, writeable
    13                                  
    14                                  				; db for data byte, 8-bit 
    15 00000000 FF0111                  db01:	db	255,1,17	; decimal values for bytes
    16 00000003 FFAB                    db02:	db	0xff,0ABh	; hexadecimal values for bytes
    17 00000005 616263                  db03:	db	'a','b','c'	; character values for bytes
    18 00000008 616263                  db04:	db	"abc"		; string value as bytes 'a','b','c'
    19 0000000B 616263                  db05:	db	'abc'		; same as "abc" three bytes
    20 0000000E 68656C6C6F0D0A00        db06:	db	"hello",13,10,0 ; "C" string including cr and lf
    21                                  
    22                                  				; dw for data word, 16-bit
    23 00000016 3930EFFF2000            dw01:	dw	12345,-17,32	; decimal values for words
    24 0000001C FFFFCDAB                dw02:	dw	0xFFFF,0abcdH	; hexadecimal values for words
    25 00000020 6100616261626300        dw03:	dw	'a','ab','abc'	; character values for words
    26 00000028 68656C6C6F00            dw04:	dw	"hello"		; three words, 6-bytes allocated
    27                                  
    28                                  				; dd for data double word, 32-bit
    29 0000002E 15CD5B07F9FFFFFF        dd01:	dd	123456789,-7	; decimal values for double words
    30 00000036 FFFFFFFF                dd02:	dd	0xFFFFFFFF	; hexadecimal value for double words
    31 0000003A 61000000                dd03:	dd	'a'		; character value in double word
    32 0000003E 68656C6C6F000000        dd04:	dd	"hello"		; string in two double words
    33 00000046 AF7D2773                dd05:	dd	13.27E30	; floating point value 32-bit IEEE
    34                                  
    35                                  				; dq for data quad word, 64-bit
    36 0000004A 141A99BE1C000000F9-     dq01:	dq	123456789012,-7	; decimal values for quad words
    37 00000053 FFFFFFFFFFFFFF     
    38 0000005A FFFFFFFFFFFFFFFF        dq02:	dq	0xFFFFFFFFFFFFFFFF ; hexadecimal value for quad words
    39 00000062 6100000000000000        dq03:	dq	'a'		; character value in quad word
    40 0000006A 68656C6C6F5F776F72-     dq04:	dq	"hello_world"	; string in two quad words
    41 00000073 6C640000000000     
    42 0000007A C86BB752A7D0737E        dq05:	dq	13.27E300	; floating point value 64-bit IEEE
    43                                  
    44                                  				; dt for data ten of 80-bit floating point
    45 00000082 4011E5A59932D5B6F0-     dt01:	dt	13.270E3000	; floating point value 80-bit in register
    46 0000008B 66                 
    47                                  
    48                                  
    49                                  	section .bss		; reserve storage space
    50                                  				; uninitialized, writeable
    51                                  	
    52 00000000           s01:	resb	10		; 10 8-bit bytes reserved
    53 0000000A           s02:	resw	20		; 20 16-bit words reserved
    54 00000032           s03:	resd	30		; 30 32-bit double words reserved
    55 000000AA           s04:	resq	40		; 40 64-bit quad words reserved
    56 000001EA           s05:	resb	1		; one more byte
    57                                  	
    58                                  	SECTION .text		; code section
    59                                          global main		; make label available to linker 
    60                                  main:				; standard  gcc  entry point
    61                                  
    62 00000000 55                      	push	rbp		; initialize stack
    63                                  	
    64 00000001 8A0425[00000000]        	mov	al,[db01]	; correct to load a byte
    65 00000008 8A2425[00000000]        	mov	ah,[db01]	; correct to load a byte
    66 0000000F 668B0425[16000000]      	mov	ax,[dw01]	; correct to load a word
    67 00000017 8B0425[2E000000]        	mov	eax,[dd01]	; correct to load a double word
    68 0000001E 488B0425[4A000000]      	mov	rax,[dq01]	; correct to load a quad word
    69                                  
    70 00000026 8A0425[00000000]        	mov	al,BYTE [db01]	; redundant, yet allowed
    71 0000002D 668B0425[00000000]      	mov	ax,[db01]	; no warning, loads two bytes
    72 00000035 8B0425[16000000]        	mov	eax,[dw01]	; no warning, loads two words
    73 0000003C 488B0425[2E000000]      	mov	rax,[dd01]	; no warning, loads two double words
    74                                  
    75                                  ;	mov	ax,BYTE [db01]	; error, size miss match
    76                                  ;	mov	eax,WORD [dw01]	; error, size miss match
    77                                  ;	mov	rax,WORD [dd01]	; error, size miss match
    78                                  
    79                                  ;	push	BYTE [db01]	; error, can not push a byte
    80 00000044 66FF3425[16000000]      	push	WORD [dw01]	; "push" needs to know size 2-byte
    81                                  ;	push	DWORD [dd01]	; error, can not push a 4-byte
    82 0000004C FF3425[4A000000]        	push	QWORD [dq01]	; error, can not push a quad word
    83                                  
    84                                  ;	push	eax		; error, wrong size, need 64-bit
    85 00000053 50                      	push	rax
    86                                  	
    87 00000054 D90425[46000000]        	fld	DWORD [dd05]	; floating load 32-bit
    88 0000005B DD0425[7A000000]        	fld	QWORD [dq05]	; floating load 64-bit
    89                                  		
    90 00000062 BB00000000              	mov	rbx,0		; exit code, 0=normal
    91 00000067 B801000000              	mov	rax,1		; exit command to kernel
    92 0000006C CD80                    	int	0x80		; interrupt 80 hex, call kernel
    93                                  
    94                                  ; end testdata.asm

Lecture 3 Registers, syntax, sections

The Intel x86-64 has many registers and named sub-registers.
This is why your 16-bit Intel programs will still run.
Here are some that are used in assembly language programming
and debugging (the "dash number" gives the number of bits):
Typically typed lower case.

+---------------------------------+  A register
|RAX-64                           |                          
|    +---------------------------+|  RAX really extended accumulator
|    | EAX-32 +-----------------+||  EAX extended accumulator
|    |        |       AX-16     |||  (lower part of dividend)
|    |        |+--------+------+|||  (quotient after division)
|    |        ||  AH-8  | AL-8 ||||  (lower part of product)
|    |        |+--------+------+|||  (H for high, L for low)
|    |        +-----------------+||
|    +---------------------------+|
+---------------------------------+

+---------------------------------+  B register
|RBX-64                           |
|    +---------------------------+|  RBX really extended base pointer
|    | EBX-32 +-----------------+||  (EBX is double word segment)
|    |        |       BX-16     |||  (BX is word segment)
|    |        |+--------+------+|||
|    |        ||  BH-8  | BL-8 ||||
|    |        |+--------+------+|||
|    |        +-----------------+||
|    +---------------------------+|
+---------------------------------+

+---------------------------------+  C register
|RCX-64                           |
|    +---------------------------+|  RCX 64-bit counter
|    | ECX-32 +-----------------+||  (string and loop operations)
|    |        |       CX-16     |||  (ECX is a 32 bit counter)
|    |        |+--------+------+|||  (CX is a 16 bit counter)
|    |        ||  CH-8  | CL-8 ||||
|    |        |+--------+------+|||
|    |        +-----------------+||
|    +---------------------------+|
+---------------------------------+

+---------------------------------+  D register
|RDX-64                           |
|    +---------------------------+|  RDX extended EDX extended DX
|    | EDX-32 +-----------------+||  (I/O pointer for memory mapped I/O)
|    |        |       DX-16     |||  (remainder after divide)
|    |        |+--------+------+|||  (upper part of dividend)
|    |        ||  DH-8  | DL-8 ||||  (upper part of product)
|    |        |+--------+------+|||
|    |        +-----------------+||
|    +---------------------------+|
+---------------------------------+

+---------------------------------+  Stack Pointer
|RSP-64                           |
|    +---------------------------+|  RSP 64-bit stack pointer
|    | ESP-32     +-------------+||  ESP extended stack pointer
|    |            | SP-16       |||  SP  stack pointer
|    |            +-------------+||  (used by PUSH and POP)
|    +---------------------------+|
+---------------------------------+

+---------------------------------+  Base Pointer
|RBP-64                           |
|    +---------------------------+|  RBP 64-bit base pointer
|    | EBP-32     +-------------+||  EBP extended base pointer
|    |            | BP-16       |||  (by convention, callers stack)
|    |            +-------------+||  (BP in ES segment)
|    +---------------------------+|  We save it, push then pop
+---------------------------------+

+---------------------------------+  Source Index
|RSI-64                           |
|    +---------------------------+|  RSI 64-bit source index
|    | ESI-32     +-------------+||  ESI extended source index
|    |            | SI-16       |||  SI  source index
|    |            +-------------+||  (SI in DS segment)
|    +---------------------------+|
+---------------------------------+

+---------------------------------+  Destination Index
|RDI-64                           |
|    +---------------------------+|  RDI 64-bit destination index
|    | EDI-32     +-------------+||  EDI extended destination index
|    |            | DI-16       |||  DI  destination index
|    |            +-------------+||  (DI in ES segment)
|    +---------------------------+|
+---------------------------------+

+---------------------------------+  Instruction Pointer
|RIP-64                           |
|    +---------------------------+|  RIP 64-bit instruction pointer
|    | EIP-32     +-------------+||  EIP extended instruction pointer
|    |            | IP-16       |||  IP  instruction pointer
|    |            +-------------+||  set by jump and call
|    +---------------------------+|
+---------------------------------+

+---------------------------------+  Flags indicating errors
|RFLAGS-64                        |
|    +---------------------------+|   RFLAGS 64-bit flags
|    | EFLAGS-32  +-------------+||   EFLAGS extended flags
|    |            | FLAGS-16    |||   FLAGS
|    |            +-------------+||   (not a register name!)
|    +---------------------------+|   (must use PUSHF and POPF)
+---------------------------------+

Additional 64-bit registers are R8, R9, R10, R11, R12, R13, R14, R15

128-bit Registers for SSE instructions and printf are  xmm0, ..., xmm15

Use of registers and little endian
see  testreg_64.asm for register syntax 
see  testreg_64.lst for binary encoding

Just a snippet of testreg_64.asm :
	section .data  		; preset constants, writeable
aa8:	db	8		; 8-bit
aa16:	dw	16		; 16-bit
aa32:	dd	32		; 32-bit
aa64:	dq	64		; 64-bit
		
	section .text		; instructions, code segment
	mov	rax,[aa64]	; five registers in RAX
	mov	eax,[aa32]	; four registers in EAX
	mov	ax,[aa16]
	mov	ah,[aa8]
	mov	al,[aa8]

Just a snippet of testreg_64.lst
(line number, hex address in segment, hex data, assembly language)
((note byte 10 hex is 16 decimal, 20 hex is 32 decimal, etc))

     8 00000000 08                      aa8:	db	8
     9 00000001 1000                    aa16:	dw	16
    10 00000003 20000000                aa32:	dd	32
    11 00000007 4000000000000000        aa64:	dq	64

    24 00000001 488B0425[07000000]      	mov	rax,[aa64]
    25 00000009 8B0425[03000000]        	mov	eax,[aa32]
    26 00000010 668B0425[01000000]      	mov	ax,[aa16]
    27 00000018 8A2425[00000000]        	mov	ah,[aa8]
    28 0000001F 8A0425[00000000]        	mov	al,[aa8]

OH! Did I forget to mention that Intel is a "little endian" machine.
The bytes are stored backwards to English.
The little end, least significant byte is first, smallest address.


Other registers that are extended include:
              +-------------+   CS code segment
              | CS-16       |
              +-------------+

              +-------------+   SS stack segment
              | SS-16       |
              +-------------+

              +-------------+   DS data segment
              | DS-16       |   (current module)
              +-------------+

              +-------------+   ES data segment
              | ES-16       |   (calling module, destination string)
              +-------------+

              +-------------+   FS heap segment
              | FS-16       |
              +-------------+

              +-------------+   GS global segment
              | GS-16       |   (shared)
              +-------------+

There are also 80-bit or more, floating point registers ST0, ..., ST7
(These are actually a stack, note FST vs FSTP etc)
There are also control registers CR0, ..., CR4
There are also debug registers DR0, DR1, DR2, DR3, DR6, DR7
There are also test registers TR3, ...., TR7



Basic NASM syntax
The basic syntax for a line in NASM is:

label:  opcode  operand(s) ; comment

The "label" is a case sensitive user name, followed by a colon.
The label is optional and when not present, indent the opcode.
The label should start in column one of the line.
The label may be on a line with nothing else or a comment.
In assembly language the "label" is an address,
not a value as it is in compiler language.

The "opcode" is not case sensitive and may be a machine instruction
or an assembler directive (pseudo operation) or a macro call.
Typically, all "opcode" fields are neatly lined up starting in the
same column. Use of "tab" is OK.
Machine instructions may be preceded by a "prefix" such as:
a16, a32, o16, o32, and others.

"operand(s)" depend on the choice of "opcode".
An operand may have several parts separated by commas,
The parts may be a combination of register names, constants,
memory references in brackets [ ] or empty.

Comments are optional, yet encouraged.
Everything from the semicolon to the end of the line is
a comment, ignored by the assembler.
The semicolon may be in column one, making the entire line
a comment. Some editors put in two semicolon, no difference.

Sections or segments:
One specific assembler directive is the "section" or "SECTION"
directive. Four types of section are predefined for ELF format:

        section  .data    ; initialized data
                          ; writeable, not executable
                          ; default alignment 8 bytes

        section  .bss     ; uninitialized space for data
                          ; writeable, not executable
                          ; default alignment 8 bytes

        section  .rodata  ; initialized data
                          ; read only, not executable
                          ; default alignment 8 bytes

        section  .text    ; instructions (code)
                          ; not writeable, executable
                          ; default alignment 16 bytes

        section  other    ; any name other than .data, .bss,
                          ; .rodata, .text
                          ; your stuff
                          ; not executable, not writeable
                          ; default alignment 1 byte

Efficiency and samples
A few comments on efficiency:
My experience is that a good assembly language programmer
can make a small (about 100 lines) "C" program more
efficient than the  gcc  compiler. But, for larger
programs, the compiler will be more efficient.

Exceptions are, for example, the SGI IRIX  cc  compiler
that has super optimization for that specific machine.

For the Intel x86-64 here are some samples in nasm and from gcc
(different syntax but you should be able to recognize the instructions)
Focus on the loop, there is prologue and epilogue code that should
be included, yet was omitted. Note the test has "check" values
at each end of the array. There is no range testing in
either "C" or assembly language.

A simple loop loopint_64.asm

; loopint_64.asm  code loopint.c for nasm 
; /* loopint_64.c a very simple loop that will be coded for nasm */
; #include <stdio.h>
; int main()
; {
;   long int dd1[100]; // 100 could be 3 gigabytes
;   long int i;        // must be long for more than 2 gigabytes
;   dd1[0]=5; /* be sure loop stays 1..98 */
;   dd1[99]=9;
;   for(i=1; i<99; i++) dd1[i]=7;
;   printf("dd1[0]=%ld, dd1[1]=%ld, dd1[98]=%ld, dd1[99]=%ld\n",
;           dd1[0], dd1[1], dd1[98],dd1[99]);
;   return 0;
;}
; execution output is dd1[0]=5, dd1[1]=7, dd1[98]=7, dd1[99]=9
 
	section	.bss
dd1:	resq	100			; reserve 100 long int
i:	resq	1			; actually unused, kept in register

        section .data			; Data section, initialized variables
fmt:    db "dd1[0]=%ld, dd1[1]=%ld, dd1[98]=%ld, dd1[99]=%ld",10,0
	
        extern	printf			; the C function, to be called

	section .text
	global	main
main:	push	rbp			; set up stack

	mov	qword [dd1],5	   	; dd1[0]=5;  memory to memory
	mov	qword [dd1+99*8],9 	; dd1[99]=9; indexed 99 qword

	mov 	rdi, 1*8		; i=1; index, will move by 8 bytes
loop1:	mov 	qword [dd1+rdi],7	; dd1[i]=7;
	add	rdi, 8			; i++;  8 bytes 
	cmp	rdi, 8*99		; i<99
	jne	loop1			; loop until incremented i=99
	
	mov	rdi, fmt		; pass address of format
	mov	rsi, qword [dd1]	; dd1[0]   first list parameter
	mov	rdx, qword [dd1+1*8]	; dd1[1]   second list parameter
	mov	rcx, qword [dd1+98*8]	; dd1[98]  third list parameter
	mov	r8,  qword [dd1+99*8]	; dd1[99]  fourth list parameter
	mov	rax, 0			; no xmm used
        call    printf			; Call C function

	pop	rbp			; restore stack
	mov	rax,0			; normal, no error, return value
	ret				; return 0;



Speed consideration must take into account cache and virtual memory
performance, number of bytes transferred from RAM and clock cycles.
On modern computer architectures, this is almost impossible. For example,
the Pentium 4 translates the 80x86 code into RISC pipeline code and
is actually executing instructions that are different from the
assembly language. Carefully benchmarking complete applications is
about the only conclusive measure of efficiency.

"C" and other programming languages may call subroutines, functions,
procedures written in assembly language. Here is a small sample
using floating point just to show use of ST registers, mentioned in comments.
Main C program test_callf1_64.c
// test_callf1_64.c   test  callf1_64.asm 
// nasm -f elf64 -l callf1_64.lst callf1_64.asm
// gcc -m64 -o test_callf1_64 test_callf1_64.c callf1_64.o
// ./test_callf1_64 > test_callf1_64.out
  #include "callf1_64.h"
  #include <stdio.h>
int main()
{
  double L[2];
  printf("test_callf1_64.c using callf1_64.asm\n");
  L[0]=1.0;
  L[1]=2.0;
  callf1_64(L); // add 3.0 to L[0], add 4.0 to L[1]
  printf("L[0]=%e, L[1]=%e \n", L[0], L[1]);
  return 0;
}

Full with debug callf1_64.asm

Stripped down  callf1_64.asm  with no demo, no debug:
; callf1_64.asm  a basic structure for a subroutine to be called from "C"
; Parameter:   double *L
; Result: L[0]=L[0]+3.0  L[1]=L[1]+4.0

        global callf1_64	; linker must know name of subroutine

        SECTION .data		; Data section, initialized variables
a3:	dq	3.0		; 64-bit variable a initialized to 3.0
a4:	dq	4.0		; 64-bit variable b initializes to 4.0

	SECTION .text           ; Code section.

callf1_64:			; name must appear as a nasm label
        push	rbp		; save rbp

	mov	rax,rdi		; first, only, in parameter, address
				; add 3.0 to L[0]
	fld	qword [rax] 	; load L[0] (pushed on flt pt stack, st0)
	fadd	qword [a3]	; floating add 3.0 (to st0)
	fstp	qword [rax]	; store into L[0] (pop flt pt stack)

	fld	qword [rax+8] 	; load L[1] (pushed on flt pt stack, st0)
	fadd	qword [a4]	; floating add 4.0 (to st0)
	fstp	qword [rax+8]	; store into L[1] (pop flt pt stack)

	pop	rbp	        ; restore callers stack frame
        ret			; return

We did not need to save floating point stack, we left it unchanged.
We could have used dt and tword for 80 bit floating point.
Calling printf uses xmm registers.


Homework 2 is assigned

Over the years I have kept snippets of computer related news.
Latest tidbit on Intel

If time: Each small part of a computer system can fetch an
instruction every clock time. The easiest way to understand
this is a pipeline. Think of water coming into a pipe, flowing
through and finally out the end.

Simple computer pipeline:
___________________________________________________________________
address -> instruction -> decode -> arithmetic -> memory -> finish 
___________________________________________________________________

We use registers that all have the system clock and each clock the
instruction moves to the next register (stage of the pipeline)
shown in the following 5 clock per instruction pipeline:

  IF instruction fetch, IP is address into memory fetching instruction
  ID instruction decode and register read out of two values
  EX execute instruction or compute data memory address
  M  data memory access to store or fetch a data word
  WB write back value into general register


         IF       ID          EX        M       WB
    +--+     +--+        +--+     +--+     +--+
    |  |     |  |        | A|-|\  |  |     |  |
    |  |     |  |    /---|  | \ \_|  |     |  |
    |IP|-(I)-|IR|-(R)  = |  | / / |  |-(D)-|  |--+
    |  |     |  |  ^ \---| B|-|/  |  |     |  |  |
    +--+     +--+  |     +--+     +--+     +--+  |
     ^        ^    |      ^   ALU  ^        ^    |
     |        |    |      |        |        |    |
 clk-+--------+-----------+--------+--------+    |
                   |                             |
                   +-----------------------------+

Lecture 4 Arithmetic and shifting

Both integer and floating point arithmetic are demonstrated.
In order to make the source code smaller, a macro is defined
to print out results. The equivalent "C" program is given as
comments.

First, see how to call the "C" library function, printf, to make
it easier to print values:
Look at the file printf1_64.asm

; printf1_64.asm   print an integer from storage and from a register
; Assemble:	nasm -f elf64 -l printf1_64.lst  printf1_64.asm
; Link:		gcc -m64 -o printf1_64  printf1_64.o
; Run:		./printf1_64 > printf1_64.out
; Output:	a=5, rax=7

; Equivalent C code
; /* printf1.c  print a long int, 64-bit, and an expression */
; #include <stdio.h>
; int main()
; {
;   long int a=5;
;   printf("a=%ld, rax=%ld\n", a, a+2);
;   return 0;
; }

; Declare external function
        extern	printf		; the C function, to be called

        SECTION .data		; Data section, initialized variables

	a:	dq	5	; long int a=5;
fmt:    db "a=%ld, rax=%ld", 10, 0	; The printf format, "\n",'0'


        SECTION .text           ; Code section.

        global main		; the standard gcc entry point
main:				; the program label for the entry point
        push    rbp		; set up stack frame
	
	mov	rax,[a]		; put "a" from store into register
	add	rax,2		; a+2  add constant 2
	mov	rdi,fmt		; format for printf
	mov	rsi,[a]         ; first parameter for printf
	mov	rdx,rax         ; second parameter for printf
	mov	rax,0		; no xmm registers
        call    printf		; Call C function

	pop	rbp		; restore stack

	mov	rax,0		; normal, no error, return value
	ret			; return
	
Printing floating point
Now, we may need to print "float" and "double" and calling  printf
gets more complicated. Still easier than doing your own conversion.
Look at the file printf2.asm
Output is printf2.out

; printf2_64.asm  use "C" printf on char, string, int, long int, float, double
; 
; Assemble:	nasm -f elf64 -l printf2_64.lst  printf2_64.asm
; Link:		gcc -m64 -o printf2_64  printf2_64.o
; Run:		./printf2_64 > printf2_64.out
; Output:	cat printf2_64.out
; 
; A similar "C" program   printf2_64.c 
; #include <stdio.h>
; int main()
; {
;   char      char1='a';            /* sample character */
;   char      str1[]="mystring";    /* sample string */
;   int       len=9;                /* sample string */
;   int       inta1=12345678;       /* sample integer 32-bit */
;   long int  inta2=12345678900;    /* sample long integer 64-bit */
;   long int  hex1=0x123456789ABCD; /* sample hexadecimal 64-bit*/
;   float     flt1=5.327e-30;       /* sample float 32-bit */
;   double    flt2=-123.4e300;      /* sample double 64-bit*/
; 
;   printf("printf2_64: flt2=%e\n", flt2);
;   printf("char1=%c, srt1=%s, len=%d\n", char1, str1, len);
;   printf("char1=%c, srt1=%s, len=%d, inta1=%d, inta2=%ld\n",
;          char1, str1, len, inta1, inta2);
;   printf("hex1=%lX, flt1=%e, flt2=%e\n", hex1, flt1, flt2);
;   return 0;
; }
        extern printf                   ; the C function to be called

        SECTION .data                   ; Data section
					; format strings for printf
fmt2:   db "printf2: flt2=%e", 10, 0
fmt3:	db "char1=%c, str1=%s, len=%d", 10, 0
fmt4:	db "char1=%c, str1=%s, len=%d, inta1=%d, inta2=%ld", 10, 0
fmt5:	db "hex1=%lX, flt1=%e, flt2=%e", 10, 0
	
char1:	db	'a'			; a character 
str1:	db	"mystring",0	        ; a C string, "string" needs 0
len:	equ	$-str1			; len has value, not an address
inta1:	dd	12345678		; integer 12345678, note dd
inta2:	dq	12345678900		; long integer 12345678900, note dq
hex1:	dq	0x123456789ABCD	        ; long hex constant, note dq
flt1:	dd	5.327e-30		; 32-bit floating point, note dd
flt2:	dq	-123.456789e300	        ; 64-bit floating point, note dq

	SECTION .bss
		
flttmp:	resq 1			        ; 64-bit temporary for printing flt1
	
        SECTION .text                   ; Code section.

        global	main		        ; "C" main program 
main:				        ; label, start of main program
	push    rbp			; set up stack frame 
	fld	dword [flt1]	        ; need to convert 32-bit to 64-bit
	fstp	qword [flttmp]          ; floating load makes 80-bit,
	                                ; store as 64-bit
	mov	rdi,fmt2
	movq	xmm0, qword [flt2]
	mov	rax, 1			; 1 xmm register
	call	printf

	mov	rdi, fmt3		; first arg, format
	mov	rsi, [char1]		; second arg, char
	mov	rdx, str1		; third arg, string
	mov	rcx, len		; fourth arg, int
	mov	rax, 0			; no xmm used
	call	printf

	mov	rdi, fmt4		; first arg, format
	mov	rsi, [char1]		; second arg, char
	mov	rdx, str1		; third arg, string
	mov	rcx, len		; fourth arg, int
	mov	r8, [inta1]		; fifth arg, inta1 32->64
	mov	r9, [inta2]		; sixth arg, inta2
	mov	rax, 0			; no xmm used
	call	printf

	mov	rdi, fmt5		; first arg, format
	mov	rsi, [hex1]		; second arg, char
	movq	xmm0, qword [flttmp]    ; first double
	movq	xmm1, qword [flt2]	; second double
	mov	rax, 2			; 2 xmm used
	call	printf
	
	pop	rbp			; restore stack	
        mov     rax, 0			; exit code, 0=normal
        ret				; main returns to operating system


Integer arithmetic	
Now, for integer arithmetic, look at the file intarith_64.asm
Output is intarith_64.out
C version is intarith_64.c
Since all the lines use the same format, a macro was created
to do the call on printf.

; intarith_64.asm    show some simple C code and corresponding nasm code
;                    the nasm code is one sample, not unique
;
; compile:	nasm -f elf64 -l intarith_64.lst  intarith_64.asm
; link:		gcc -m64 -o intarith_64  intarith_64.o
; run:		./intarith_64 > intarith_64.out
;
; the output from running intarith_64.asm and intarith.c is:	
; c=5  , a=3, b=4, c=5
; c=a+b, a=3, b=4, c=7
; c=a-b, a=3, b=4, c=-1
; c=a*b, a=3, b=4, c=12
; c=c/a, a=3, b=4, c=4
;
;The file  intarith.c  is:
;  /* intarith.c */
;  #include <stdio.h>
;  int main()
;  { 
;    long int a=3, b=4, c;
;    c=5;
;    printf("%s, a=%ld, b=%ld, c=%ld\n","c=5  ", a, b, c);
;    c=a+b;
;    printf("%s, a=%ld, b=%ld, c=%ld\n","c=a+b", a, b, c);
;    c=a-b;
;    printf("%s, a=%ld, b=%ld, c=%ld\n","c=a-b", a, b, c);
;    c=a*b;
;    printf("%s, a=%ld, b=%ld, c=%ld\n","c=a*b", a, b, c);
;    c=c/a;
;    printf("%s, a=%ld, b=%ld, c=%ld\n","c=c/a", a, b, c);
;    return 0;
; }
        extern printf		; the C function to be called

%macro	pabc 1			; a "simple" print macro
	section .data
.str	db	%1,0		; %1 is first actual in macro call
	section .text
        mov     rdi, fmt4	; first arg, format
	mov	rsi, .str	; second arg
	mov     rdx, [a]        ; third arg
	mov     rcx, [b]        ; fourth arg
	mov     r8, [c]         ; fifth arg
	mov     rax, 0	        ; no xmm used
	call    printf		; Call C function
%endmacro
	
	section .data  		; preset constants, writeable
a:	dq	3		; 64-bit variable a initialized to 3
b:	dq	4		; 64-bit variable b initializes to 4
fmt4:	db "%s, a=%ld, b=%ld, c=%ld",10,0	; format string for printf
	
	section .bss 		; unitialized space
c:	resq	1		; reserve a 64-bit word

	section .text		; instructions, code segment
	global	 main		; for gcc standard linking
main:				; label
	push 	rbp		; set up stack
lit5:				; c=5;
	mov	rax,5	 	; 5 is a literal constant
	mov	[c],rax		; store into c
	pabc	"c=5  "		; invoke the print macro
	
addb:				; c=a+b;
	mov	rax,[a]	 	; load a
	add	rax,[b]		; add b
	mov	[c],rax		; store into c
	pabc	"c=a+b"		; invoke the print macro
	
subb:				; c=a-b;
	mov	rax,[a]	 	; load a
	sub	rax,[b]		; subtract b
	mov	[c],rax		; store into c
	pabc	"c=a-b"		; invoke the print macro
	
mulb:				; c=a*b;
	mov	rax,[a]	 	; load a (must be rax for multiply)
	imul	qword [b]	; signed integer multiply by b
	mov	[c],rax		; store bottom half of product into c
	pabc	"c=a*b"		; invoke the print macro
	
diva:				; c=c/a;
	mov	rax,[c]	 	; load c
	mov	rdx,0		; load upper half of dividend with zero
	idiv	qword [a]	; divide double register edx rax by a
	mov	[c],rax		; store quotient into c
	pabc	"c=c/a"		; invoke the print macro

	pop	rbp		; pop stack
        mov     rax,0           ; exit code, 0=normal
	ret			; main returns to operating system


Note that two registers are used for general multiply and divide.

        bbbb  [mem] a product of 64-bits times 64-bits is 128-bits
 imul   bbbb  rax
   ---------
rdx bbbbbbbb  rax   the upper part of the product is in rdx
                    the lower part of the product is in rax


rdx bbbbbbbb  rax  before divide, the upper part of dividend is in rdx
                                  the lower part of dividend is in rax
 idiv   bbbb  [mem] the divisor
    --------
                   after divide,  the quotient is in rax
                                  the remainder is in rdx

Floating point arithmetic
Now, for floating point arithmetic, look at the file fltarith_64.asm
Output is fltarith_64.out
C version is fltarith_64.c
Since all the lines use the same format, a macro was created
to do the call on printf.

Note the many similarities to integer arithmetic, yet some basic differences.

; fltarith_64.asm   show some simple C code and corresponding nasm code
;                   the nasm code is one sample, not unique
;
; compile  nasm -f elf64 -l fltarith_64.lst  fltarith_64.asm
; link     gcc -m64 -o fltarith_64  fltarith_64.o
; run      ./fltarith_64 > fltarith_64.out
;
; the output from running fltarith and fltarithc is:	
; c=5.0, a=3.000000e+00, b=4.000000e+00, c=5.000000e+00
; c=a+b, a=3.000000e+00, b=4.000000e+00, c=7.000000e+00
; c=a-b, a=3.000000e+00, b=4.000000e+00, c=-1.000000e+00
; c=a*b, a=3.000000e+00, b=4.000000e+00, c=1.200000e+01
; c=c/a, a=3.000000e+00, b=4.000000e+00, c=4.000000e+00
; a=i  , a=8.000000e+00, b=1.600000e+01, c=1.600000e+01
; a<=b , a=8.000000e+00, b=1.600000e+01, c=1.600000e+01
; b==c , a=8.000000e+00, b=1.600000e+01, c=1.600000e+01
;The file  fltarith.c  is:
;  #include <stdio.h>
;  int main()
;  { 
;    double a=3.0, b=4.0, c;
;    long int i=8;
;
;    c=5.0;
;    printf("%s, a=%e, b=%e, c=%e\n","c=5.0", a, b, c);
;    c=a+b;
;    printf("%s, a=%e, b=%e, c=%e\n","c=a+b", a, b, c);
;    c=a-b;
;    printf("%s, a=%e, b=%e, c=%e\n","c=a-b", a, b, c);
;    c=a*b;
;    printf("%s, a=%e, b=%e, c=%e\n","c=a*b", a, b, c);
;    c=c/a;
;    printf("%s, a=%e, b=%e, c=%e\n","c=c/a", a, b, c);
;    a=i;
;    b=a+i;
;    i=b;
;    c=i;
;    printf("%s, a=%e, b=%e, c=%e\n","c=c/a", a, b, c);
;    if(a<b) printf("%s, a=%e, b=%e, c=%e\n","a<=b ", a, b, c);
;    else    printf("%s, a=%e, b=%e, c=%e\n","a>b  ", a, b, c);
;    if(b==c)printf("%s, a=%e, b=%e, c=%e\n","b==c ", a, b, c);
;    else    printf("%s, a=%e, b=%e, c=%e\n","b!=c ", a, b, c);
;    return 0;
; }

        extern printf		; the C function to be called

%macro	pabc 1			; a "simple" print macro
	section	.data
.str	db	%1,0		; %1 is macro call first actual parameter
	section .text
				; push onto stack backwards 
        mov	rdi, fmt	; address of format string
	mov	rsi, .str	; string passed to macro
	movq	xmm0, qword [a]	; first floating point in fmt
	movq	xmm1, qword [b]	; second floating point
	movq	xmm2, qword [c]	; third floating point
	mov	rax, 3		; 3 floating point arguments to printf
        call    printf          ; Call C function
%endmacro
	
	section	.data  		; preset constants, writeable
a:	dq	3.0		; 64-bit variable a initialized to 3.0
b:	dq	4.0		; 64-bit variable b initializes to 4.0
i:	dq	8		; a 64 bit integer
five:	dq	5.0		; constant 5.0
fmt:    db "%s, a=%e, b=%e, c=%e",10,0	; format string for printf
	
	section .bss 		; unitialized space
c:	resq	1		; reserve a 64-bit word

	section .text		; instructions, code segment
	global	main		; for gcc standard linking
main:				; label

	push	rbp		; set up stack
lit5:				; c=5.0;
	fld	qword [five]	; 5.0 constant
	fstp	qword [c]	; store into c
	pabc	"c=5.0"		; invoke the print macro
	
addb:				; c=a+b;
	fld	qword [a] 	; load a (pushed on flt pt stack, st0)
	fadd	qword [b]	; floating add b (to st0)
	fstp	qword [c]	; store into c (pop flt pt stack)
	pabc	"c=a+b"		; invoke the print macro
	
subb:				; c=a-b;
	fld	qword [a] 	; load a (pushed on flt pt stack, st0)
	fsub	qword [b]	; floating subtract b (to st0)
	fstp	qword [c]	; store into c (pop flt pt stack)
	pabc	"c=a-b"		; invoke the print macro
	
mulb:				; c=a*b;
	fld	qword [a]	; load a (pushed on flt pt stack, st0)
	fmul	qword [b]	; floating multiply by b (to st0)
	fstp	qword [c]	; store product into c (pop flt pt stack)
	pabc	"c=a*b"		; invoke the print macro
	
diva:				; c=c/a;
	fld	qword [c] 	; load c (pushed on flt pt stack, st0)
	fdiv	qword [a]	; floating divide by a (to st0)
	fstp	qword [c]	; store quotient into c (pop flt pt stack)
	pabc	"c=c/a"		; invoke the print macro

intflt:				; a=i;
	fild	qword [i]	; load integer as floating point
	fst	qword [a]	; store the floating point (no pop)
	fadd	st0		; b=a+i; 'a' as 'i'  already on flt stack
	fst	qword [b]	; store sum (no pop) 'b' still on stack
	fistp	qword [i]	; i=b; store floating point as integer
	fild	qword [i]	; c=i; load again from ram (redundant)
	fstp	qword [c]
	pabc	"a=i  "		; invoke the print macro

cmpflt:	fld	qword [b]	; into st0, then pushed to st1
	fld	qword [a]	; in st0
	fcomip	st0,st1		; a compare b, pop a
	jg	cmpfl2
	pabc	"a<=b "
	jmp	cmpfl3
cmpfl2:	
	pabc	"a>b  "
cmpfl3:
	fld	qword [c]	; should equal [b]
	fcomip  st0,st1
	jne	cmpfl4
	pabc	"b==c "
	jmp	cmpfl5
cmpfl4:
	pabc	"b!=c "
cmpfl5:

	pop	rbp		; pop stack
        mov     rax,0           ; exit code, 0=normal
	ret			; main returns to operating system

Shift data in a register
Refer to nasmdoc.txt for details.
A brief summary is provided here.
"reg" is an 8-bit, 16-bit or 32-bit or 64-bit register
"count" is a number of bits to shift
"right" moves contents of the register to the right, makes it smaller
"left" moves contents of the register to the left, makes it bigger

  SAL   reg,count   shift arithmetic left
  SAR   reg,count   shift arithmetic right (sign extension)
  SHL   reg,count   shift left (logical, zero fill)
  SHR   reg,count   shift right (logical, zero fill)
  ROL   reg,count   rotate left
  ROR   reg,count   rotate right
  SHLD  reg1,reg2,count  shift left double-register 
  SHRD  reg1,reg2,count  shift right double-register

An example of using the various shifts is in: shift_64.asm
Output is shift_64.out
Just to make it easy to check, we keep all shift amounts a multiple
of 4, 4 bits per hex digit in output.

; shift_64.asm    the nasm code is one sample, not unique
;
; compile:	nasm -f elf64 -l shift_64.lst  shift_64.asm
; link:		gcc -m64 -o shift_64  shift_64.o
; run:		./shift_64 > shift_64.out
;
; the output from running shift.asm (zero filled) is:	
; shl rax,4, old rax=ABCDEF0987654321, new rax=BCDEF09876543210, 
; shl rax,8, old rax=ABCDEF0987654321, new rax=CDEF098765432100, 
; shr rax,4, old rax=ABCDEF0987654321, new rax= ABCDEF098765432, 
; sal rax,8, old rax=ABCDEF0987654321, new rax=CDEF098765432100, 
; sar rax,4, old rax=ABCDEF0987654321, new rax=FABCDEF098765432, 
; rol rax,4, old rax=ABCDEF0987654321, new rax=BCDEF0987654321A, 
; ror rax,4, old rax=ABCDEF0987654321, new rax=1ABCDEF098765432, 
; shld rdx,rax,8, old rdx:rax=0,ABCDEF0987654321,
;                 new rax=ABCDEF0987654321 rdx=              AB, 
; shl rax,8     , old rdx:rax=0,ABCDEF0987654321,
;                 new rax=CDEF098765432100 rdx=              AB, 
; shrd rdx,rax,8, old rdx:rax=0,ABCDEF0987654321,
;                 new rax=ABCDEF0987654321 rdx=2100000000000000, 
; shr rax,8     , old rdx:rax=0,ABCDEF0987654321,
;                 new rax=  ABCDEF09876543 rdx=2100000000000000, 

        extern printf		; the C function to be called

%macro	prt	1		; old and new rax
	section .data
.str	db	%1,0		; %1 is which shift string
	section .text
        mov	rdi, fmt	; address of format string
	mov	rsi, .str 	; callers string
	mov	rdx,rax		; new value
	mov	rax, 0		; no floating point
        call    printf          ; Call C function
%endmacro

%macro	prt2	1		; old and new rax,rdx
	section .data
.str	db	%1,0		; %1 is which shift
	section .text
        mov	rdi, fmt2	; address of format string
	mov	rsi, .str 	; callers string
	mov	rcx, rdx	; new rdx befor next because used
	mov	rdx, rax	; new rax
	mov	rax, 0		; no floating point
        call    printf          ; Call C function
%endmacro

	 section .bss
raxsave: resq	1		; save rax while calling a function 
rdxsave: resq	1		; save rdx while calling a function 
	
	section .data  		; preset constants, writeable
b64:	dq	0xABCDEF0987654321	; data to shift
fmt:    db "%s, old rax=ABCDEF0987654321, new rax=%16lX, ",10,0	; format string
fmt2:   db "%s, old rdx:rax=0,ABCDEF0987654321,",10,"                new rax=%16lX rdx=%16lX, ",10,0
	
	section .text		; instructions, code segment
	global	 main		; for gcc standard linking
main:	push	rbp		; set up stack
	
shl1:	mov	rax, [b64]	; data to shift
	shl	rax, 4		; shift rax 4 bits, one hex position left
	prt	"shl rax,4 "	; invoke the print macro

shl4:	mov	rax, [b64]	; data to shift
	shl	rax,8		; shift rax 8 bits. two hex positions left
	prt	"shl rax,8 "	; invoke the print macro

shr4:	mov	rax, [b64]	; data to shift
	shr	rax,4		; shift
	prt	"shr rax,4 "	; invoke the print macro

sal4:	mov	rax, [b64]	; data to shift
	sal	rax,8		; shift
	prt	"sal rax,8 "	; invoke the print macro

sar4:	mov	rax, [b64]	; data to shift
	sar	rax,4		; shift
	prt	"sar rax,4 "	; invoke the print macro

rol4:	mov	rax, [b64]	; data to shift
	rol	rax,4		; shift
	prt	"rol rax,4 "	; invoke the print macro

ror4:	mov	rax, [b64]	; data to shift
	ror	rax,4		; shift
	prt	"ror rax,4 "	; invoke the print macro

shld4:	mov	rax, [b64]	; data to shift
	mov	rdx,0		; register receiving bits
	shld	rdx,rax,8	; shift
	mov	[raxsave],rax	; save, destroyed by function
	mov	[rdxsave],rdx	; save, destroyed by function
	prt2	"shld rdx,rax,8"; invoke the print macro

shla:	mov	rax,[raxsave]	; restore, destroyed by function
	mov	rdx,[rdxsave]	; restore, destroyed by function
	shl	rax,8		; finish double shift, both registers
	prt2	"shl rax,8     "; invoke the print macro

shrd4:	mov	rax, [b64]	; data to shift
	mov	rdx,0		; register receiving bits
	shrd	rdx,rax,8	; shift
	mov	[raxsave],rax	; save, destroyed by function
	mov	[rdxsave],rdx	; save, destroyed by function
	prt2	"shrd rdx,rax,8"; invoke the print macro

shra:	mov	rax,[raxsave]	; restore, destroyed by function
	mov	rdx,[rdxsave]	; restore, destroyed by function
	shr	rax,8		; finish double shift, both registers
	prt2	"shr rax,8     "; invoke the print macro

	pop	rbp		; restore stack
	mov     rax,0           ; exit code, 0=normal
	ret			; main returns to operating system

First project is assigned.
You may want to do this in Lab this Friday.
www.cs.umbc.edu/~squire/cmpe310_proj.shtml


Instructions and data come from the cache

The "cache" is very high speed memory on the CPU chip.
Typical CPU's can get words out of the cache every clock.
In order to be as fast as the logic on the CPU, the cache
can not be as large as the main memory. Typical cache sizes
are hundreds of kilobytes to a few megabytes.

There is typically a level 1 instruction cache, a level 1
data cache. These would be in the blocks on our project
schematic labeled instruction memory and data memory.

Then, there is typically a level 2 unified cache that is
larger and may be slower than the level 1 caches. Unified
means it is used for both instructions and data.

Some computers have a level 3 cache that is larger and
slower than the level 2 cache. Multi core computers
have at least a L1 instruction cache and a L1 data cache
for every core. Some have a L3 unified cache that is
available to all cores. Thus data can go from one core
to another without going through RAM.


     +-----------+   +-----------+
     | L1 Icache |   | L1 Dcache |
     +-----------+   +-----------+
           |               |
     +---------------------------+
     | L2 unified cache          |
     +---------------------------+
              |
           +------+
           | RAM  |
           +------+
              |
           +------+
           | Disc |  or Solid State Drive, SSD
           +------+

The goal of the computer system is to use the cache for instructions
and data in order to execute instructions as fast as possible.
Typical RAM requires 5 to 10 clocks to get an instruction or
data word. A typical CPU does prefetching and branch prediction
to bring instructions into the cache in order to minimize
stalls waiting for instructions. You will simulate a cache and
the associated stalls in part 3 of your project.

Intel IA-64 cache structure, page 3
IA-64 Itanium


An approximate hierarchy is:

                size    response
     CPU                  0.5 ns  2 GHz clock
     L1 cache  .032MB     0.5 ns  one for instructions, another for data
     L2 cache     4MB     1.0 ns
     RAM       4000MB     4.0 ns
     disk    500000MB     4.0 ms = 4,000,000 ns

A program is loaded from disk, into RAM, then as needed
into L2 cache, then as needed into L1 cache, then as needed
into the CPU pipelines.
1)  The CPU initiates the request by sending the L1 cache an address.
    If the L1 cache has the value at that address, the value is quickly
    sent to the CPU.
2)  If the L1 cache does not have the value, the address is passed to
    the L2 cache. If the L2 cache has the value, the value is quickly
    passed to the L1 cache. The L1 cache passes the value to the CPU.
3)  If the L2 cache does not have the value at the address, the
    address is passed to a memory controller that must access RAM
    in order to get the value. The value passes from RAM, through
    the memory controller to the L2 cache then to the L1 cache then
    to the CPU.

This may seem tedious yet each level is optimized to provide good
performance for the total system. One reason the system is fast is
because of wide data paths. The RAM data path may be 128-bits or
256-bits wide. This wide data path may continue through the
L2 cache and L1 cache. The cache is organized in blocks
(lines or entries may be used in place of the word blocks)
that provide for many bytes of data to be accessed in parallel.
When reading from a cache, it is like combinational logic, it
is not clocked. When writing into a cache it must write on
a clock edge.

A cache receives an address, a computer address, a binary number.
The parts of the cache are all powers of two. The basic unit of
an address is a byte. For our study, four bytes, one word, will
always be fetched from the cache. When working the homework
problems be sure to read the problem carefully to determine if
the addresses given are byte addresses or word addresses.
It will be easiest and less error prone if all addresses are
converted to binary for working the homework.

The basic elements of a cache are:
  A valid bit: This is a 1 if values are in the cache block
  A tag field: This is the upper part of the address for
               the values in the cache block.
  Cache block: The values that may be instructions or data

Here is the absolutely simplest cache with one word blocks

Lecture 5 Using debugger, options

See www.csee.umbc.edu/help/nasm/nasm_64.shtml for notes on using debugger.

A program that prints where its sections are allocated
(in virtual memory) is where_64.asm
My output, yours should be different, is
where_64.out

; where_64.asm   print addresses of sections
; Assemble:	nasm -g -f elf64 -l where_64.lst  where_64.asm
; Link:		gcc -g3 -m64 -o where_64  where_64.o
; Run:		./where_64 > where_64.out
; Output:	you need to run it, on my computer
; data    a: at 601034
; bss     b: at 60108C
; rodata  c: at 400640
; code main: at 400530
;
; to debug, typically after  segfault
; gdb where_64
; run
; break main
; disassemble main
; backtrace
;             hopefully this will point to where the problem is in source

        extern	printf		; the C function, to be called
        section .data		; Data section, initialized variables
a:	db	0,1,2,3,4,5,6,7
fmt:    db "data    a: at %lX",10
	db "bss     b: at %lX",10
	db "rodata  c: at %lX",10
	db "code main: at %lX",10,0 

	section .bss		; reserved storage, uninitialized
b:	resq	8

	section	.rodata		; read only initialized storage
c:	db	7,6,5,4,3,2,1,0
	
        section .text           ; Code section.
        global main		; the standard gcc entry point
main:				; the program label for the entry point
	push	rbp
	mov	rbp,rsp
	push	rbx		; save callers registers
	
	mov	rdi,fmt		; pass address of fmt to printf
	lea	rsi,[a]		; using load effective address
	lea	rdx,[b]		; using load effective address
	lea	rcx,[c]		; using load effective address
	lea	r8,[main]	; using load effective address
	mov	rax,0		; no float
        call    printf		; Call C function

	mov	rdi,fmt		; pass address of fmt to printf
	mov	rsi,a		; just loading address
	mov	rdx,b		; just loading address
	mov	rcx,c		; just loading address
	mov	r8,main		; just loading address
	mov	rax,0		; no float
        call    printf		; Call C function

	pop	rbx		; restore callers registers
	mov	rsp,rbp
	pop	rbp
	mov	rax,0		; normal, no error, return value
	ret			; return

gdb  disassemble main produces:
(gdb) disassemble main
Dump of assembler code for function main:
0x0000000000400530 <+0>:push   %rbp
0x0000000000400531 <+1>:mov    %rsp,%rbp
0x0000000000400534 <+4>:push   %rbx
0x0000000000400535 <+5>:movabs $0x60103c,%rdi
0x000000000040053f <+15>:lea    0x601034,%rsi
0x0000000000400547 <+23>:lea    0x60108c,%rdx
0x000000000040054f <+31>:lea    0x400640,%rcx
0x0000000000400557 <+39>:lea    0x400530,%r8
0x000000000040055f <+47>:mov    $0x0,%eax
0x0000000000400564 <+52>:callq  0x400410 
0x0000000000400569 <+57>:movabs $0x60103c,%rdi
0x0000000000400573 <+67>:movabs $0x601034,%rsi
0x000000000040057d <+77>:movabs $0x60108c,%rdx
0x0000000000400587 <+87>:movabs $0x400640,%rcx
0x0000000000400591 <+97>:movabs $0x400530,%r8
0x000000000040059b <+107>:mov    $0x0,%eax
0x00000000004005a0 <+112>:callq  0x400410 
0x00000000004005a5 <+117>:pop    %rbx
0x00000000004005a6 <+118>:mov    %rbp,%rsp
0x00000000004005a9 <+121>:pop    %rbp
0x00000000004005aa <+122>:mov    $0x0,%eax
0x00000000004005af <+127>:retq   
End of assembler dump.

part of where_64.lst
Note address of each section starts at zero
.data
    18 00000000 0001020304050607        a:	db	0,1,2,3,4,5,6,7
    19 00000008 646174612020202061-     fmt:    db "data    a: at %lX",10
    20 00000011 3A20617420256C580A 
    21 0000001A 627373202020202062-     	db "bss     b: at %lX",10
    22 00000023 3A20617420256C580A 
    23 0000002C 726F64617461202063-     	db "rodata  c: at %lX",10
    24 00000035 3A20617420256C580A 
    25 0000003E 636F6465206D61696E-     	db "code main: at %lX",10,0 
    26 00000047 3A20617420256C580A-
    27 00000050 00                 
    28                                  
.bss
    30 00000000           b:	resq	8
.rodata
    33 00000000 0706050403020100        c:	db	7,6,5,4,3,2,1,0
.text
    38 00000000 55                      	push	rbp
    39 00000001 4889E5                  	mov	rbp,rsp
    40 00000004 53                      	push	rbx
    42 00000005 48BF-                   	mov	rdi,fmt
    43 00000007 [0800000000000000] 
    44 0000000F 488D3425[00000000]      	lea	rsi,[a]
    45 00000017 488D1425[00000000]      	lea	rdx,[b]
    46 0000001F 488D0C25[00000000]      	lea	rcx,[c]

Options that may allow you to debug
Typical assembly language programming, may just use registers,
or may keep most variables just in registers.
Storing variables in memory may be needed for debugging.

This example starts with a small C program,fib.c
then codes efficient assembly language,fib_64l.asm
Output, shows overflow fib_64l.out
then keeps variables in memory,fib_64m.asm

// fib.c  same as computation as fib_64.asm
#include <stdio.h>
int main(int argc, char *argv[])
{
  long int c = 95;  // loop counter
  long int a = 1;   // current number, becomes next
  long int b = 2;   // next number, becomes sum a+b
  long int d;       // temp

  printf("fibinachi numbers\n");
  for(c=c; c!=0; c--)
  {
    printf("%21ld\n",a);
    d = a;
    a = b;
    b = d+b;
  }
}

implement fib.c using registers
; fib_64l.asm  using 64 bit registers to implement fib.c
	global main
	extern printf

	section .data
format:	db '%21ld', 10, 0
title:	db 'fibinachi numbers', 10, 0
	
	section .text
main:
	push rbp 		; set up stack
	mov rdi, title 		; arg 1 is a pointer
	mov rax, 0 		; no vector registers in use
	call printf

	mov rcx, 95 		; rcx will countdown from 52 to 0
	mov rax, 1 		; rax will hold the current number
	mov rbx, 2 		; rbx will hold the next number
print:
	;  We need to call printf, but we are using rax, rbx, and rcx.
	;  printf may destroy rax and rcx so we will save these before
	;  the call and restore them afterwards.
	push rax 		; 32-bit stack operands are not encodable
	push rcx 		; in 64-bit mode, so we use the "r" names
	mov rdi, format 	; arg 1 is a pointer
	mov rsi, rax 		; arg 2 is the current number
	mov rax, 0 		; no vector registers in use
	call printf
	pop rcx
	pop rax
	mov rdx, rax 		; save the current number
	mov rax, rbx 		; next number is now current
	add rbx, rdx 		; get the new next number
	dec rcx 		; count down
	jnz print 		; if not done counting, do some more

	pop rbp 		; restore stack
	mov rax, 0		; normal exit
	ret


implement fib.c  using memory
; fib_64m.asm  using 64 bit memory more like C code
; // fib.c  same as computation as fib_64m.asm
; #include <stdio.h>
; int main(int argc, char *argv[])
; {
;   long int c = 95;  // loop counter
;   long int a = 1;   // current number, becomes next
;   long int b = 2;   // next number, becomes sum a+b
;   long int d;       // temp
;   printf("fibinachi numbers\n");
;   for(c=c; c!=0; c--)
;   {
;     printf("%21ld\n",a);
;     d = a;
;     a = b;
;     b = d+b;
;   }
; }
	global main
	extern printf

	section .bss
d:	resq	1		; temp  unused, kept in register rdx
	
	section .data
c:	dq	95		; loop counter
a:	dq	1		; current number, becomes next
b:	dq	2		; next number, becomes sum a+b


format:	db '%21ld', 10, 0
title:	db 'fibinachi numbers', 10, 0
	
	section .text
main:
	push rbp 		; set up stack
	mov rdi, title 		; arg 1 is a pointer
	mov rax, 0 		; no vector registers in use
	call printf

print:
	;  We need to call printf, but we are using rax, rbx, and rcx.
	mov rdi, format 	; arg 1 is a pointer
	mov rsi,[a] 		; arg 2 is the current number
	mov rax, 0 		; no vector registers in use
	call printf

	mov rdx,[a] 		; save the current number, in register
	mov rbx,[b] 		;
	mov [a],rbx		; next number is now current, in ram
	add rbx, rdx 		; get the new next number
	mov [b],rbx		; store in ram
	mov rcx,[c]		; get loop count
	dec rcx 		; count down
	mov [c],rcx		; save in ram
	jnz print 		; if not done counting, do some more

	pop rbp 		; restore stack
	mov rax, 0		; normal exit
	ret			; return to operating system

Homework 3 is assigned

Operating Systems use pages
Not a joke.
Operating systems run many processes. See Windows Task Manager,
Try Linux  run  top   . Typically more than 50 processes.
Using lots of RAM.
Hard drives, disc, are read and written by sector, not byte or word.
Operating Systems allocate pages for processes, not byte or word.
Consider subset of three processes, P1, P2, P3 running in the OS.
Even is only one byte was needed by a .data or .bss segments,
the OS would allocate a page. Large segments may take many pages.
A possible RAM layout, not sequential, scattered
Page    user
0-56    OS
57      P3 .data
58      P2 .bss
90-91   P3 .text
92      P1 .bss
93      P3 .bss
94-96   P2 .data
8250    P2 .text
9470    P1 .data
9480    P1 .text

Remember all addresses may start with zero in every segment of
every process. They may be relocated during linking to very
large addresses, yet many processes may get linked to
the same address. OH! How can these processes run at the
same time in the same RAM?
Yes, virtual memory! It has been around for many years.
Efficient virtual memory uses a TLB, Translation Lookaside Buffer.
In the OS there is a Page Table for every process. The OS
keeps a free page list that are pages available to be assigned
a process when it starts. We saw a cache in the previous
lecture, technically, the TLB is a cache for the OS to
use on the page tables.

The addresses you see in a load map or debugger are virtual
addresses, not the real RAM address.
The virtual address and physical address do not necessarily
have to be the same number of bits. The operation of virtual
memory is to convert a virtual address to a physical address:

          Programmers Virtual Address
  +----------------------------+-------------+
  |    Virtual Page Number VPN | page offset |
  +----------------------------+-------------+
               |                    |
               v                    |
              TLB                   |
               |                    |
               v                    v
    +--------------------------+-------------+
    | Physical Page Number PPN | page offset |
    +--------------------------+-------------+
                RAM Physical Address




Follow the virtual address to ultimately a physical address.

One obvious fact: A page must be a power of two bytes.
e.g. 4KB. Also a sector, that may be a different size,
as small as 256 bytes.

This is a very simplified example. Actual hardware is much
more complicated. Note the TLB in our first lecture.
Lecture 1 architecture

Lecture 6 Branching and loops

UGH! Note that < and > are interpreted by HTML,
thus source code, physically included, has & gt ; rather than symbol.
Be sure to download from link, not from HTML.

The basic integer compare instruction is  "cmp"
Following this instruction is typically one of:
  JL  label  ; jump on less than  "<"
  JLE label  ; jump on less than or equal "<="
  JG  label  ; jump on greater than ">"
  JGE label  ; jump on greater than or equal ">="
  JE  label  ; jump on equal "=="
  JNE label  ; jump on not equal "!="

After many integer arithmetic instructions
  JZ  label  ; jump on zero
  JNZ label  ; jump on non zero
  JS  label  ; jump on sign plus
  JNS label;  ; jump on sign not plus

Note: Use 'cmp' rather than 'sub' for comparison.
Overflow can occur on subtraction resulting in sign inversion.

if-then-else in assembly language
Convert a "C" 'if' statement to nasm assembly ifint_64.asm
The significant features are:
1) use a compare instruction for the test
2) put a label on the start of the false branch (e.g. false1:)
3) put a label after the end of the 'if' statement (e.g. exit1:)
4) choose a conditional jump that goes to the false part
5) put an unconditional jump to (e.g. exit1:) at the end of the true part

; ifint_64.asm  code ifint_64.c for nasm 
; /* ifint_64.c an 'if' statement that will be coded for nasm */
; #include <stdio.h>
; int main()
; {
;   long int a=1;
;   long int b=2;
;   long int c=3;
;   if(a<b)
;     printf("true a < b \n");
;   else
;     printf("wrong on a < b \n");
;   if(b>c)
;     printf("wrong on b > c \n");
;   else
;     printf("false b > c \n");
;   return 0;
;}
; result of executing both "C" and assembly is:
; true a < b
; false b > c 
	
	global	main		; define for linker
        extern	printf		; tell linker we need this C function
        section .data		; Data section, initialized variables
a:	dq 1
b:	dq 2
c:	dq 3
fmt1:   db "true a < b ",10,0
fmt2:   db "wrong on a < b ",10,0
fmt3:   db "wrong on b > c ",10,0
fmt4:   db "false b > c ",10,0

	section .text
main:	push	rbp		; set up stack
	mov	rax,[a]		; a
	cmp	rax,[b]		; compare a to b
	jge	false1		; choose jump to false part
	; a < b sign is set
        mov	rdi, fmt1	; printf("true a < b \n"); 
        call    printf	
        jmp	exit1		; jump over false part
false1:	;  a < b is false 
        mov	rdi, fmt2	; printf("wrong on a < b \n");
        call    printf
exit1:				; finished 'if' statement

	mov	rax,[b]		; b
	cmp	rax,[c]		; compare b to c
	jle	false2		; choose jump to false part
	; b > c sign is not set
        mov	rdi, fmt3	; printf("wrong on b > c \n");
        call    printf	
        jmp	exit2		; jump over false part
false2:	;  b > c is false 
        mov	rdi, fmt4	; printf("false b > c \n");
        call    printf
exit2:				; finished 'if' statement

	pop	rbp		; restore stack
	mov	rax,0		; normal, no error, return value
	ret			; return 0;



loop in assembly language
Convert a "C" loop to nasm assembly  loopint_64.asm
The significant features are:
1) "C" long int  is 8-bytes, thus  dd1[1] becomes  dword [dd1+8]
                              dd1[99] becomes  dword [dd1+8*99]

2) "C" long int  is 8-bytes, thus  dd1[i]; i++; becomes  add edi,8
   since "i" is never stored, the register  edi  holds "i"

3) the 'cmp' instruction sets flags that control the jump instruction.
   cmp  edi,8*99   is like  i<99 in "C"
   jne  loop1      jumps if register  edi  is not  8*99

; loopint_64.asm  code loopint.c for nasm 
; /* loopint_64.c a very simple loop that will be coded for nasm */
; #include <stdio.h>
; int main()
; {
;   long int dd1[100]; // 100 could be 3 gigabytes
;   long int i;        // must be long for more than 2 gigabytes
;   dd1[0]=5; /* be sure loop stays 1..98 */
;   dd1[99]=9;
;   for(i=1; i<99; i++) dd1[i]=7;
;   printf("dd1[0]=%ld, dd1[1]=%ld, dd1[98]=%ld, dd1[99]=%ld\n",
;           dd1[0], dd1[1], dd1[98],dd1[99]);
;   return 0;
;}
; execution output is dd1[0]=5, dd1[1]=7, dd1[98]=7, dd1[99]=9
 
	section	.bss
dd1:	resq	100			; reserve 100 long int
i:	resq	1			; actually unused, kept in register

        section .data			; Data section, initialized variables
fmt:    db "dd1[0]=%ld, dd1[1]=%ld, dd1[98]=%ld, dd1[99]=%ld",10,0
	
        extern	printf			; the C function, to be called

	section .text
	global	main
main:	push	rbp			; set up stack

	mov	qword [dd1],5	   	; dd1[0]=5;  memory to memory
	mov	qword [dd1+99*8],9 	; dd1[99]=9; indexed 99 qword

	mov 	rdi, 1*8		; i=1; index, will move by 8 bytes
loop1:	mov 	qword [dd1+rdi],7	; dd1[i]=7;
	add	rdi, 8			; i++;  8 bytes 
	cmp	rdi, 8*99		; i<99
	jne	loop1			; loop until incremented i=99
	
	mov	rdi, fmt		; pass address of format
	mov	rsi, qword [dd1]	; dd1[0]   first list parameter
	mov	rdx, qword [dd1+1*8]	; dd1[1]   second list parameter
	mov	rcx, qword [dd1+98*8]	; dd1[98]  third list parameter
	mov	r8,  qword [dd1+99*8]	; dd1[99]  fourth list parameter
	mov	rax, 0			; no xmm used
        call    printf			; Call C function

	pop	rbp			; restore stack
	mov	rax,0			; normal, no error, return value
	ret				; return 0;

	
logic operations in assembly language
Previously, integer arithmetic in "C" was converted to
NASM assembly language. The following is very similar
(cut and past) of intarith_64.asm to intlogic_64.asm that
shows the "C" operators "&" and, "|" or, "^" xor, "~" not.

intlogic_64.asm

; intlogic_64.asm    show some simple C code and corresponding nasm code
;                    the nasm code is one sample, not unique
;
; compile:	nasm -f elf64 -l intlogic_64.lst  intlogic_64.asm
; link:		gcc -m64 -o intlogic_64  intlogic_64.o
; run:		./intlogic_64 > intlogic_64.out
;
; the output from running intlogic_64.asm and intlogic.c is
; c=5  , a=3, b=5, c=15
; c=a&b, a=3, b=5, c=1
; c=a|b, a=3, b=5, c=7
; c=a^b, a=3, b=5, c=6
; c=~a , a=3, b=5, c=-4
;
;The file  intlogic.c  is:
;  #include <stdio.h>
;  int main()
;  { 
;    long int a=3, b=5, c;
;
;    c=15;
;    printf("%s, a=%d, b=%d, c=%d\n","c=5  ", a, b, c);
;    c=a&b; /* and */
;    printf("%s, a=%d, b=%d, c=%d\n","c=a&b", a, b, c);
;    c=a|b; /* or */
;    printf("%s, a=%d, b=%d, c=%d\n","c=a|b", a, b, c);
;    c=a^b; /* xor */
;    printf("%s, a=%d, b=%d, c=%d\n","c=a^b", a, b, c);
;    c=~a;  /* not */
;    printf("%s, a=%d, b=%d, c=%d\n","c=~a", a, b, c);
;    return 0;
; }

        extern printf		; the C function to be called

%macro	pabc 1			; a "simple" print macro
	section .data
.str	db	%1,0		; %1 is first actual in macro call
	section .text
        mov	rdi, fmt        ; address of format string
	mov	rsi, .str 	; users string
	mov	rdx, [a]	; long int a
	mov	rcx, [b]	; long int b 
	mov	r8, [c]		; long int c
	mov     rax, 0	        ; no xmm used
        call    printf          ; Call C function
%endmacro
	
	section .data  		; preset constants, writeable
a:	dq	3		; 64-bit variable a initialized to 3
b:	dq	5		; 64-bit variable b initializes to 4
fmt:    db "%s, a=%ld, b=%ld, c=%ld",10,0 ; format string for printf
	
	section .bss 		; unitialized space
c:	resq	1		; reserve a 64-bit word

	section .text		; instructions, code segment
	global	 main		; for gcc standard linking
main:				; label
	push	rbp		; set up stack
	
lit5:				; c=5;
	mov	rax,15	 	; 5 is a literal constant
	mov	[c],rax		; store into c
	pabc	"c=5  "		; invoke the print macro
	
andb:				; c=a&b;
	mov	rax,[a]	 	; load a
	and	rax,[b]		; and with b
	mov	[c],rax		; store into c
	pabc	"c=a&b"		; invoke the print macro
	
orw:				; c=a-b;
	mov	rax,[a]	 	; load a
	or	rax,[b]		; logical or with b
	mov	[c],rax		; store into c
	pabc	"c=a|b"		; invoke the print macro
	
xorw:				; c=a^b;
	mov	rax,[a]	 	; load a
	xor	rax,[b] 	; exclusive or with b
	mov	[c],rax		; store result in c
	pabc	"c=a^b"		; invoke the print macro
	
notw:				; c=~a;
	mov	rax,[a]	 	; load c
	not	rax	 	; not, complement
	mov	[c],rax		; store result into c
	pabc	"c=~a "		; invoke the print macro

	pop	rbp		; restore stack
	mov     rax,0           ; exit code, 0=normal
	ret			; main returns to operating system


loops in assembly language
One significant use of loops is to evaluate polynomials and
convert numbers from one base to another.
(Yes, this is related to project 1 for CMPE 310)

The following program has three loops.

Loop3 (h3loop) uses Horners method to evaluate a polynomial,
       using 'rdi' as an index, 'rcx' and 'loop' to do the loop.
       a_0 is first in the array, n=4.

Loop4 (h4loop) uses Horners method, with data order optimized,
      using 'rcx' as both index and loop counter, to get a
      three instruction loop.
      a_4 is first in the array, n=4.

Loop5 (h5loop) uses Horners method to evaluate a polynomial
      using double precision floating point. Note 8 byte
      increment and quad word to xmm0, to printf.


Horners method to evaluate polynomials in assembly language
Study horner_64.asm to understand
the NASM coding of the loops.

; horner_64.asm  Horners method of evaluating polynomials
;
; given a polynomial  Y = a_n X^n + a_n-1 X^n-1 + ... a_1 X + a_0
; a_n is the coefficient 'a' with subscript n. X^n is X to nth power
; compute y_1 = a_n * X + a_n-1
; compute y_2 = y_1 * X + a_n-2
; compute y_i = y_i-1 * X + a_n-i   i=3..n
; thus    y_n = Y = value of polynomial 
;
; in assembly language:
;   load some register with a_n, multiply by X
;   add a_n-1, multiply by X, add a_n-2, multiply by X, ...
;   finishing with the add  a_0
;
; output from execution:
; a  6319
; aa 6319
; af 6.319000e+03

	extern	printf
	section	.data
	global	main

	section	.data
fmta:	db	"a  %ld",10,0
fmtaa:	db	"aa %ld",10,0
fmtflt:	db	"af %e",10,0

	section	.text
main:	push	rbp		; set up stack

; evaluate an integer polynomial, X=7, using a count

	section	.data
a:	dq	2,5,-7,22,-9	; coefficients of polynomial, a_n first
X:	dq	7		; X = 7
				; n=4, 8 bytes per coefficient
	section	.text
	mov	rax,[a]		; accumulate value here, get coefficient a_n
	mov	rdi,1		; subscript initialization
	mov	rcx,4		; loop iteration count initialization, n
h3loop:	imul	rax,[X]		; * X     (ignore edx)
	add	rax,[a+8*rdi]	; + a_n-i
	inc	rdi		; increment subscript
	loop	h3loop		; decrement rcx, jump on non zero

	mov	rsi, rax	; print rax
	mov	rdi, fmta	; format
	mov	rax, 0		; no float
	call	printf


; evaluate an integer polynomial, X=7, using a count as index
; optimal organization of data allows a three instruction loop
	
	section	.data
aa:	dq	-9,22,-7,5,2	; coefficients of polynomial, a_0 first
n:	dq	4		; n=4, 8 bytes per coefficient
	section	.text
	mov	rax,[aa+4*8]	; accumulate value here, get coefficient a_n
	mov	rcx,[n]		; loop iteration count initialization, n
h4loop:	imul	rax,[X]		; * X     (ignore edx)
	add	rax,[aa+8*rcx-8]; + aa_n-i
	loop	h4loop		; decrement rcx, jump on non zero

	mov	rsi, rax	; print rax
	mov	rdi, fmtaa	; format
	mov	rax, 0		; no float
	call	printf

; evaluate a double floating polynomial, X=7.0, using a count as index
; optimal organization of data allows a three instruction loop
	
	section	.data
af:	dq	-9.0,22.0,-7.0,5.0,2.0	; coefficients of polynomial, a_0 first
XF:	dq	7.0
Y:	dq	0.0
N:	dd	4

	section	.text
	mov	rcx,[N]		; loop iteration count initialization, n
	fld	qword [af+8*rcx]; accumulate value here, get coefficient a_n
h5loop:	fmul	qword [XF]	; * XF
	fadd	qword [af+8*rcx-8] ; + aa_n-i
	loop	h5loop		; decrement rcx, jump on non zero

	fstp	qword [Y]	; store Y in order to print Y
	movq	xmm0, qword [Y]	; well, may just mov reg
	mov	rdi, fmtflt	; format
	mov	rax, 1		; one float
	call	printf

	pop	rbp		; restore stack
	mov	rax,0		; normal return
	ret			; return

A "C" version with same data, slightly different code sequence.

// horner_64.c long integer and double Horners method of evaluating polynomials
//             everything 64-bit
// given a polynomial  Y = a_n X^n + a_n-1 X^n-1 + ... a_1 X + a_0
// a_n is the coefficient 'a' with subscript n. X^n is X to nth power
// compute y_1 = a_n * X + a_n-1
// compute y_2 = y_1 * X + a_n-2
// compute y_i = y_i-1 * X + a_n-i   i=3..n
// thus    y_n = Y = value of polynomial 

 #include <stdio.h>
int main(int argc, char *argv[])
{
  long int a[]  = {2, 5, -7, 22, -9}; // a_n first
  long int aa[] = {-9, 22, -7, 5, 2}; // aa_0 first
  double af[]   = {-9.0, 22.0, -7.0, 5.0, 2.0}; // af_0 first
  long int n    = 4;
  long int X, Y;
  double XF, YF; 
  long int i;

  // evaluate an integer polynomial a, X=7, using a_n first, count n
  X = 7;
  Y = a[0]*X + a[1];
  for(i=2; i<=n; i++) Y = Y*X + a[i];
  printf("a  %ld\n", Y);

  // evaluate an integer polynomial aa , X=7, using a_0 first, count n
  X = 7;
  Y = aa[n]*X + aa[n-1];
  for(i=n-2; i>=0; i--) Y = Y*X + aa[i];
  printf("aa %ld\n", Y);

  // evaluate a double floating polynomial, X=7.0, using af_0 first, n
  XF = 7.0;
  YF = af[n]*X + af[n-1];
  for(i=n-2; i>=0; i--) YF = YF*XF + af[i];
  printf("af %e\n", YF);

  return 0;
}

Same output:
a  6319
aa 6319
af 6.319000e+03

serial vs parallel, slow vs fast
Multiply hardware, serial


Multiply hardware, parallel











Then for wiring ground and power


Possibly many mask layers


Many complete chips are baked on a wafer

Lecture 7 Subroutines

Pass an array and change the array in assembly language.
Be safe, use a header file, .h, in the "C" code.
test_call1_64.c
test_call1_64.s
call1_64.h
call1_64.c
call1_64.s
call1_64.asm
test_call1_64.out

// test_call1_64.c   test  call1_64.asm 
// nasm -f elf64 -l call1_64.lst call1_64.asm
// gcc -m64 -o test_call1_64 test_call1_64.c call1_64.o
// ./test_call1_64 > test_call1_64.out

 #include "call1_64.h"
 #include <stdio.h>
 int main()
 {
   long int L[2];
   printf("test_call1_64.c using call1_64.asm\n");
   L[0]=1;
   L[1]=2;
   printf("address of L=L[0]=%ld, L[1]=%ld \n", &L, &L[1]);
   call1_64(L); // add 3 to L[0], add 4 to L[1]
   printf("L[0]=%ld, L[1]=%ld \n", L[0], L[1]);
   return 0;
 }

; call1_64.asm  a basic structure for a subroutine to be called from "C"
;
; Parameter:   long int *L
; Result: L[0]=L[0]+3  L[1]=L[1]+4

        global call1_64		; linker must know name of subroutine

        extern	printf		; the C function, to be called for demo

        SECTION .data		; Data section, initialized variables
fmt1:    db "rdi=%ld, L[0]=%ld", 10, 0	; The printf format, "\n",'0'
fmt2:    db "rdi=%ld, L[1]=%ld", 10, 0	; The printf format, "\n",'0'

	SECTION .bss
a:	resq	1		; temp for printing

	SECTION .text           ; Code section.

call1_64:			; name must appear as a nasm label
        push	rbp		; save rbp
        mov	rbp, rsp	; rbp is callers stack
        push	rdx		; save registers
        push	rdi
	push	rsi

	mov	rax,rdi		; first, only, in parameter
	mov	[a],rdi		; save for later use

	mov	rdi,fmt1	; format for printf debug, demo
	mov	rsi,rax         ; first parameter for printf
	mov	rdx,[rax]	; second parameter for printf
	mov	rax,0		; no xmm registers
        call    printf		; Call C function	

	mov	rax,[a]		; first, only, in parameter, demo
	mov	rdi,fmt2	; format for printf
	mov	rsi,rax         ; first parameter for printf
	mov	rdx,[rax+8]	; second parameter for printf
	mov	rax,0		; no xmm registers
        call    printf		; Call C function	

	mov	rax,[a]		; add 3 to L[0]
	mov	rdx,[rax]	; get L[0]
	add	rdx,3		; add
	mov	[rax],rdx	; store sum for caller

	mov	rdx,[rax+8]	; get L[1]
	add	rdx,4		; add
	mov	[rax+8],rdx	; store sum for caller

        pop	rsi		; restore registers
	pop	rdi		; in reverse order
        pop	rdx
        mov	rsp,rbp		; restore callers stack frame
        pop	rbp
        ret			; return


A small change to use a double array, floating point
test_callf1_64.c
callf1_64.h
callf1_64.c
callf1_64.asm
test_callf1_64.out

; callf1_64.asm  a basic structure for a subroutine to be called from "C"
;
; Parameters:   double *L
; Result: L[0]=L[0]+3.0  L[1]=L[1]+4.0

        global callf1_64	; linker must know name of subroutine

        extern	printf		; the C function, to be called for demo

        SECTION .data		; Data section, initialized variables
fmt1:	db "rdi=%ld, L[0]=%e", 10, 0	; The printf format, "\n",'0'
fmt2:	db "rdi=%ld, L[1]=%e", 10, 0	; The printf format, "\n",'0'
a3:	dq	3.0		; 64-bit variable a initialized to 3.0
a4:	dq	4.0		; 64-bit variable b initializes to 4.0

	SECTION .bss
a:	resq	1		; temp for saving address

	SECTION .text           ; Code section.

callf1_64:			; name must appear as a nasm label
        push	rbp		; save rbp
        mov	rbp, rsp	; rbp is callers stack
        push	rdx		; save registers
        push	rdi
	push	rsi

	mov	rax,rdi		; first, only, in parameter
	mov	[a],rdi		; save for later use

;	mov	rdi,fmt1	; format for printf debug, demo
;	mov	rsi,rax         ; first parameter for printf
;       movq    xmm0, qword [rax] ; second parameter for printf
;	mov	rax,1		; one xmm registers
;	call    printf		; Call C function	

;	mov	rax,[a]		; first, only, in parameter, demo
;	mov	rdi,fmt2	; format for printf
;	mov	rsi,rax         ; first parameter for printf
;	movq    xmm0, qword [rax+8] ; second parameter for printf
;	mov	rax,1		; one xmm registers
;	call    printf		; Call C function	

	mov	rax,[a]		; add 3.0 to L[0]
	fld	qword [rax] 	; load L[0] (pushed on flt pt stack, st0)
	fadd	qword [a3]	; floating add 3.0 (to st0)
	fstp	qword [rax]	; store into L[0] (pop flt pt stack)

	fld	qword [rax+8] 	; load L[1] (pushed on flt pt stack, st0)
	fadd	qword [a4]	; floating add 4.0 (to st0)
	fstp	qword [rax+8]	; store into L[1] (pop flt pt stack)

        pop	rsi		; restore registers
	pop	rdi		; in reverse order
        pop	rdx
        mov	rsp,rbp		; restore callers stack frame
        pop	rbp
        ret			; return


	



Here is another basic subroutine (function, procedure, etc)
Note passing parameters.
Note saving and restoring the callers registers.
(Yes, this is needed for CMPE 310 project 2)

Now, to pass more arguments, call2_64.c
can be implemented as call2_64.asm
Both tested using test_call2_64.c
Using prototype call2_64.h
Output is test_call2_64.out

Note passing arrays including strings is via address,
     passing scalar values is via passing values.

; call2_64.asm  code call2_64.c for nasm for test_call2_64.c 
; // call2_64.c a very simple loop that will be coded for nasm
; void call2_64(long int *A, long int start, long int end, long int value)
; {
;   long int i;
;
;   for(i=start; i<=end; i++) A[i]=value;
;   return;
; }
;
; execution output is dd1[0]=5, dd1[1]=7, dd1[98]=7, dd1[99]=9
 
	section	.bss
i:	resd	1		; actually unused, kept in register rax

	section .text
	global	call2_64	; linker must know name of subroutine
call2_64:			; name must appear as a nasm label
        push	rbp		; save rbp
        mov	rbp, rsp	; rbp is callers stack
        push	rax		; save registers (overkill)
	push	rbx
	push	rcx
	push	rdx
				; know address or value from prototype
;	mov	rdi,rdi		; get address of A into rdi (default)
        mov	rax,rsi		; get value of start
        mov	rbx,rdx		; get value of end
        mov	rdx,rcx		; get value of value
	
loop1:	mov	[8*rax+rdi],rdx	; A[i]=value;
	add	rax,1		; i++;
	cmp	rax,rbx		; i<=end
	jle	loop1		; loop until i=end

	pop	rdx		; in reverse order
	pop	rcx
	pop	rbx
	pop	rax
        mov	rsp,rbp		; restore callers stack frame
        pop	rbp
        ret			; return




A simple program with a simple function, 
called and written in the same .asm file
intfunct_64.asm

; intfunct_64.asm  this is a main and a function in one file
;                  call integer function  long int sum(long int x, long int y) 
; compile:	nasm -f elf64 -l intfunct_64.lst intfunct_64.asm 
; link:		gcc -m64 -o intfunct_64 intfunct_64.o
; run:		./intfunct_64 > intfunct_64.out
; view:         cat intfunct_64.out
; result:	5 = sum(2,3)

	extern	printf
	section .data
x:	dq	2
y:	dq	3
fmt:	db	"%ld = sum(%ld,%ld)",10,0

	section .bss
z:	resq	1
	
	section .text
	global	main
main:	push	rbp		; set up stack

	mov	rdi, [x]	; pass arguments for sum
	mov	rsi, [y]
	call	sum		; coded below
	mov	[z],rax		; save result from sum

	mov	rdi, fmt	; print
	mov	rsi, [z]
	mov	rdx, [x]	; yes, rdx comes before rcx
	mov	rcx, [y]
	mov	rax, 0		; no float or double
	call	printf

	pop	rbp		; restore stack
	mov	rax,0
	ret
; end main

sum:	; function long int sum(long int x, long int y)
	; so simple, do not need to save anything

	mov	rax,rdi		; get argument x
	add	rax,rsi		; add argument y, x+y result in rax
	ret			; return value in rax
; end of function sum
; end intfunct_64.asm


A simple demonstration of using a double sin(double x) function
from the "C" math.h  fltfunct_64.asm
Note xmm 128-bit registers are used to pass parameters.

; fltfunct_64.asm  call math routine  double sin(double x) 
; compile:	nasm -f elf64 fltfunct_64.asm 
; link:		gcc -m64 -o fltfunct_64 fltfunct_64.o -lm # needs math library
; run:		./fltfunct_64 > fltfunct_64.out
; view:		cat fltfunct_64.out
	
	extern	sin		; must extern all library functions
	extern	printf
	section .data
x:	dq	0.7853975	; Pi/4 = 45 degrees
y:	dq	0.0		; should be about 7.07E-1
fmt:	db	"y= %e = sin(%e)",10,0

	section .text
	global	main
main:	push	rbp		; set up stack

	movq	xmm0, qword [x]	; pass argument to sin()
	call	sin		; all "C" math uses double
	movq	qword [y], xmm0	; save result

	mov	rdi, fmt	; print
	movq	xmm0, qword [y]
	movq	xmm1, qword [x]
	mov	rax,2		; 2 doubles
	call	printf

	pop	rbx		; restore stack
	mov	rax,0		; no error return
	ret			; return to operating system

; result: y= 7.071063e-01 = sin(7.853975e-01)

Now, if you want to see why I teach Nasm rather than gas,
See fltfunct_64.c
// fltfunct_64.c  call math routine  double sin(double x) 
 #include <stdio.h>
 #include <math.h>
 int main()
 {
   double x = 0.7853975;	// Pi/4 = 45 degrees
   double y;
   y = sin(x);
   printf("y= %e = sin(%e)\n", y, x);
   return 0;
 }

See gas assembly language, gcc -m64 -S fltfunct_64.c makes fltfunct_64.s
	.file	"fltfunct_64.c"
	.section	.rodata
.LC1:
	.string	"y= %e = sin(%e)\n"
	.text
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	subq	$32, %rsp
	movabsq	$4605249451321951854, %rax
	movq	%rax, -8(%rbp)
	movq	-8(%rbp), %rax
	movq	%rax, -24(%rbp)
	movsd	-24(%rbp), %xmm0
	call	sin
	movsd	%xmm0, -24(%rbp)
	movq	-24(%rbp), %rax
	movq	%rax, -16(%rbp)
	movq	-8(%rbp), %rdx
	movq	-16(%rbp), %rax
	movq	%rdx, -24(%rbp)
	movsd	-24(%rbp), %xmm1
	movq	%rax, -24(%rbp)
	movsd	-24(%rbp), %xmm0
	movl	$.LC1, %edi
	movl	$2, %eax
	call	printf
	movl	$0, %eax
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
	.ident	"GCC: (GNU) 4.8.2 20140120 (Red Hat 4.8.2-16)"
	.section	.note.GNU-stack,"",@progbits


And a final example of a simple recursive function, factorial,
written in optimized assembly language following the "C" code.
main        test_factorial_long.c
"C" version factorial_long.c
"C" header  factorial_long.h
nasm code   factorial_long.asm
output      test_factorial_long.out

// test_factorial_long.c  the simplest example of a recursive function
//                   a recursive function is a function that calls itself
// external
// long int factorial_long(long int n) // n! is n factorial = 1*2*...*(n-1)*n
// {
//   if( n <= 1 ) return 1;            // must have a way to stop recursion
//   return n * factorial_long(n-1);   // factorial calls factorial with n-1
// }                                   // n * (n-1) * (n-2) * ... * (1) 
 #include "factorial_long.h"
 #include <stdio.h>

int main()
{
  printf("test_factorial_long.c using long int, note overflow\n");
  printf(" 0!=%ld \n", factorial_long(0));   // Yes, 0! is one
  printf(" 1!=%ld \n", factorial_long(1l));  // Yes, "C" would convert 1 to 1l
  printf(" 2!=%ld \n", factorial_long(2l));  // because of function prototype
  printf(" 3!=%ld \n", factorial_long(3l));  // coming from factorial_long.h
  printf(" 4!=%ld \n", factorial_long(4l));
  printf(" 5!=%ld \n", factorial_long(5l));
  printf(" 6!=%ld \n", factorial_long(6l));
  printf(" 7!=%ld \n", factorial_long(7l));
  printf(" 8!=%ld \n", factorial_long(8l));
  printf(" 9!=%ld \n", factorial_long(9l));
  printf("10!=%ld \n", factorial_long(10l));
  printf("11!=%ld \n", factorial_long(11l));
  printf("12!=%ld \n", factorial_long(12l));
  printf("13!=%ld \n", factorial_long(13l));
  printf("14!=%ld \n", factorial_long(14l));
  printf("15!=%ld \n", factorial_long(15l));
  printf("16!=%ld \n", factorial_long(16l));
  printf("17!=%ld \n", factorial_long(17l));
  printf("18!=%ld \n", factorial_long(18l));
  printf("19!=%ld \n", factorial_long(19l));
  printf("20!=%ld \n", factorial_long(20l));
  printf("21!=%ld \n", factorial_long(21l)); /* expect a problem with */
  printf("22!=%ld \n", factorial_long(22l)); /* integer overflow      */

  return 0;
}

/* output of execution, with comments, is:
test_factorial_long.c using long int, note overflow
 0!=1 
 1!=1 
 2!=2 
 3!=6 
 4!=24 
 5!=120 
 6!=720 
 7!=5040 
 8!=40320 
 9!=362880 
10!=3628800 
11!=39916800 
12!=479001600 
13!=6227020800 
14!=87178291200 
15!=1307674368000 
16!=20922789888000 
17!=355687428096000 
18!=6402373705728000 
19!=121645100408832000 
20!=2432902008176640000 
21!=-4249290049419214848 wrong and obvious if you check your results 
22!=-1250660718674968576 
*/

; factorial_long.asm  test with test_factorial_long.c main program
; // factorial_long.c  the simplest example of a recursive function
; //                   a recursive function is a function that calls itself
; long int factorial_long(long int n) // n! is n factorial = 1*2*...*(n-1)*n
; {
;   if( n <= 1 ) return 1;            // must have a way to stop recursion
;   return n * factorial_long(n-1);   // factorial calls factorial with n-1
; }				      // n * (n-1) * (n-2) * ... * (1)
;                                     // note: "C" makes 1 be 1l  long
	global	factorial_long
	section .text
factorial_long:			; extremely efficient version
	mov	rax, 1		; default return
	cmp	rdi, 1		; compare n to 1
	jle	done

	; normal case n * factorial_long(n-1)
	push	rdi		; save n for multiply
	sub	rdi, 1		; n-1
	call	factorial_long	; rax has factorial_long(n-1)
	pop	rdi
	imul	rax,rdi		; ??
				; n*factorial_long(n-1) in rax
done:	ret			; return with result in rax
; end factorial_long.asm


Project 2 is assigned

Need more RAM?
4GB RAM

Lecture 8 Boot programs and 16-bit


A sample of a basic stand alone bootable program for NASM is boot1.asm

; boot1.asm   stand alone program for floppy boot sector
; Compiled using            nasm -f bin boot1.asm
; Written to floppy with    sudo dd if=boot1 of=/dev/fd0
; Written to USB drive      sudo dd if=boot1 of=/dev/sdb

; Boot record is loaded at 0000:7C00,
	ORG 7C00h
; load message address into SI register:
	LEA SI,[msg]
; screen function:
	MOV AH,0Eh
print:  MOV AL,[SI]         
	CMP AL,0         
	JZ done		; zero byte at end of string
	INT 10h		; write character to screen.    
     	INC SI         
	JMP print

; wait for 'any key':
done:   MOV AH,0       
    	INT 16h		; waits for key press
			; AL is ASCII code or zero
			; AH is keyboard code

; store magic value at 0040h:0072h to reboot:
;		0000h - cold boot.
;		1234h - warm boot.
	MOV  AX,0040h
	MOV  DS,AX
	MOV  word[0072h],0000h   ; cold boot.
	JMP  0FFFFh:0000h	 ; reboot!

msg 	DB  'Welcome, I have control of the computer.',13,10
	DB  'Press any key to reboot.',13,10
	DB  '(after removing the floppy)',13,10,0
; end boot1

This program could be extended to find or verify the keycodes
that are input (not all keys have ASCII codes).

One keyboard has the following ASCII and keycodes ascii.txt

American Standard Code for Information Interchange, ASCII
(with keycodes for a particular 104 key keyboard)
  dec is decimal value
  hex is 8-bit hexadecimal value
  key is 104-key PC keyboard keycode in hexadecimal
  type means how to type character (shift not shown) C- for hold control down
  def  is control character definition, e.g. LF line feed, FF form feed,
       CR carriage return, BS back space,
                                          
dec hex key type def   dec hex key type   dec hex key type   dec hex key type
  0  00  13 C-@  NULL   32  20  5E space   64  40  13 @       96  60  11 `
  1  01  3C C-A  SOH    33  21  12 !       65  41  3C A       97  61  3C a
  2  02  50 C-B  STX    34  22  46 "       66  42  50 B       98  62  50 b
  3  03  4E C-C  ETX    35  23  14 #       67  43  4E C       99  63  4E c
  4  04  3E C-D  EOT    36  24  15 $       68  44  3E D      100  64  3E d
  5  05  29 C-E  ENQ    37  25  16 %       69  45  29 E      101  65  29 e
  6  06  3F C-F  ACK    38  26  18 &       70  46  3F F      102  66  3F f
  7  07  40 C-G  BEL    39  27  46 '       71  47  40 G      103  67  40 g
  8  08  41 C-H  BS     40  28  1A (       72  48  41 H      104  68  41 h
  9  09  2E C-I  HT     41  29  1B )       73  49  2E I      105  69  2E i
 10  0A  42 C-J  LF     42  2A  19 *       74  4A  42 J      106  6A  42 j
 11  0B  43 C-K  VT     43  2B  1D +       75  4B  43 K      107  6B  43 k
 12  0C  44 C-L  FF     44  2C  53 ,       76  4C  44 L      108  6C  44 l
 13  0D  52 C-M  CR     45  2D  1C -       77  4D  52 M      109  6D  52 m
 14  0E  51 C-N  SO     46  2E  54 .       78  4E  51 N      110  6E  51 n
 15  0F  2F C-O  SI     47  2F  55 /       79  4F  2F O      111  6F  2F o
 16  10  30 C-P  DLE    48  30  1B 0       80  50  30 P      112  70  30 p
 17  11  27 C-Q  DC1    49  31  12 1       81  51  27 Q      113  71  27 q
 18  12  2A C-R  DC2    50  32  13 2       82  52  2A R      114  72  2A r
 19  13  3D C-S  DC3    51  33  14 3       83  53  3D S      115  73  3D s
 20  14  2B C-T  DC4    52  34  15 4       84  54  2B T      116  74  2B t
 21  15  2D C-U  NAK    53  35  16 5       85  55  2D U      117  75  2D u
 22  16  4F C-V  SYN    54  36  17 6       86  56  4F V      118  76  4F v
 23  17  2E C-W  ETB    55  37  17 7       87  57  28 W      119  77  28 w
 24  18  4D C-X  CAN    56  38  19 8       88  58  4D X      120  78  4D x
 25  19  2C C-Y  EM     57  39  1A 9       89  59  2C Y      121  79  2C y
 26  1A  4C C-Z  SUB    58  3A  45 :       90  5A  4C Z      122  7A  4C z
 27  1B  31 C-[  ESC    59  3B  45 ;       91  5B  31 [      123  7B  31 {
 28  1C  33 C-\  FS     60  3C  53 <       92  5C  33 \      124  7C  33 |
 29  1D  32 C-]  GS     61  3D  3D =       93  5D  32 ]      125  7D  32 }
 30  1E  17 C-^  RS     62  3E  54 >       94  5E  17 ^      126  7E  11 ~
 31  1F  1C C-_  US     63  3F  55 ?       95  5F  1C _      127  7F  34 delete

Additional key codes (most have no ASCII)[must track shift-up, shift-down etc.]
  key type        key type          key type             key type
  01  ESCAPE      10  PAUSE          39 keypad 9 PAGE UP  5D LEFT ALT
  02  F1          1E  BACKSPACE      3A keypad +          5E SPACE
  03  F2          1F  INSERT         3B CAPS LOCK         5F RIGHT ALT
  04  F3          20  HOME           47 ENTER             60 RIGHT CTRL
  05  F4          21  PAGE UP        48 keypad 4 LEFT     61 LEFT ARROW
  06  F5          22  NUM LOCK       49 keypad 5          62 DOWN ARROW
  07  F6          23  keypad /       4A keypad 6 RIGHT    63 RIGHT ARROW
  08  F7          24  keypad *       4B LEFT SHIFT        64 keypad 0 INS
  09  F8          25  keypad -       56 RIGHT SHIFT       65 keypad . DEL
  0A  F9          26  TAB            57 UP ARROW          66 LEFT WINDOWS
  0B  F10         34  DELETE         58 keypad 1 END      67 RIGHT WINDOWS
  0C  F11         35  END            59 keypad 2 DOWN     68 APPLICATION
  0D  F12         36  PAGE DOWN      5A keypad 3 PAGE DN  7E SYS REQ
  0E  PRT SCRN    37  keypad 7 HOME  5B keypad ENTER      7F BREAK
  0F  SCROLL LOCK 38  keypad 8 UP    5C LEFT CTRL

Sample program to print key  called keysym_num
     test_keysym.py
from Tkinter import *
root=Tk()
root.title('test_keysym.py')
print 'test_keysym.py window gets and shows text'

def reportEvent(event):
  print 'keysym=%s, keysym_num=%s' % (event.keysym, event.keysym_num)

text=Text(root,width=100,height=5,highlightthickness=2)
text.bind('<KeyPress>', reportEvent)

text.pack(expand=1, fill='both')
text.focus_set()
root.mainloop()

Output from typing some keys 
test_keysym.py window gets and shows text
keysym=a, keysym_num=97
keysym=Shift_L, keysym_num=65505
keysym=A, keysym_num=65
keysym=z, keysym_num=122
keysym=Shift_L, keysym_num=65505
keysym=Z, keysym_num=90
keysym=0, keysym_num=48
keysym=9, keysym_num=57
keysym=Return, keysym_num=65293

Additional information on keycodes and keysym are here

Now you may wish to download another self booting program,
memtest.bin a binary program.
If you can get this file, undamaged, onto your computer, running
linux, then you can write a floppy disk:

  dd if=memtest.bin of=/dev/fd0  # if you happen to have a floppy disc
  dd if=memtest.bin of=/dev/sdb  # or other device for flash drive

Then do a safe shutdown.
Reboot your computer from the power off state.
You should see information about your computer.
e.g. clock speed, type of CPU, cache sizes, RAM size,
and it will run a very thurough memory test on your RAM.

You will not be able to run a bootable floppy on a UMBC
Intel PC because the BIOS should be set to not boot from
a floppy and the BIOS should be password protected, so you
can not change the BIOS. The machine is probably secured
so you can not get in and change the BIOS chip. 

More on bootable floppies is at nasm boot info

For our lab, using assembler on Windows

; part1m.asm
; Make the LCD display my name
	
BITS 16
CPU 8086

section CONSTSEG USE16 ALIGN=16 CLASS=CONST
Data:
myName: 	db " my name "
nameLen:  	equ $ - myName

section PROGRAM USE16 ALIGN=16 CLASS=CODE

    ..start

name:

	mov   ax, 1		; code for display
	mov   bx, myName   	; string address
	mov   cx, nameLen    	; string length
	mov   dx, 0    		; code
	mov   si, ds		; address space
	int   10H		; call BIOS


www.dosbox.com
DOSBox tutorial board.pdf

More on BIOS

Lecture 9 syscall and BIOS calls

Assembly language can run with no C compiler and no OS
Special hardware may need assembly language

A first program that does not need a C compiler, that uses the
Operating System calls.

hellos_64.asm

;  ------------------------------------------------------------------------
;  hellos_64.asm
;  Writes "Hello, World" to the console using only system calls.
;  Runs on 64-bit Linux only.
;  To assemble and run: using single Linux command
;
;  nasm -f elf64 hellos_64.asm && ld hellos_64.o && ./a.out   
;
;  -------------------------------------------------------------------------
	global  _start        ; standard ld main program

	section .data	      ; data section
msg:	db "Hello World",10   ; the string to print, newline 10
len:	equ $-msg	      ; "$" means "here"
			      ; len is a value, not an address

	section .text     
_start:
	;       write(1, msg, 13)  equivalent system command
	mov  	rax, 1	      ; system call 1 is write
	mov  	rdi, 1	      ; file handle 1 is stdout
	mov  	rsi, msg      ; address of string to output
	mov	rdx, len      ; number of bytes computed 13
	syscall		      ; invoke operating system to do the write

	;       exit(0)         equivalent system command
	mov     rax, 60	      ; system call 60 is exit
	xor     rdi, rdi      ; exit code 0
	syscall		      ; invoke operating system to exit



A few basic BIOS calls:
See BIOS info

On UMBC computers, you do not have "root" privilege, thus
you can not use BIOS calls. On your own Linux systems,
you could do   sudo ./a.out   to run a program.

If you can cause a program to boot from a floppy, a CD, a DVD, or
a flash drive, you can assemble a program that will run without
an operating system. The boot process uses the BIOS and the BIOS
has functions that can be called via interrupts.

A sample bootable program is boot1.asm

; boot1.asm   stand alone program for floppy boot sector
; Compiled using            nasm -f bin boot1.asm
; Written to floppy with    dd if=boot1 of=/dev/fd0
	
; Boot record is loaded at 0000:7C00,
	ORG 7C00h
; load message address into SI register:
	LEA SI,[msg]
; screen function:
	MOV AH,0Eh
print:  MOV AL,[SI]         
	CMP AL,0         
	JZ done		; zero byte at end of string
	INT 10h		; write character to screen.    
     	INC SI         
	JMP print

; wait for 'any key':
done:   MOV AH,0       
    	INT 16h		; waits for key press
			; AL is ASCII code or zero
			; AH is keyboard code

; store magic value at 0040h:0072h to reboot:
;		0000h - cold boot.
;		1234h - warm boot.
	MOV  AX,0040h
	MOV  DS,AX
	MOV  word[0072h],0000h   ; cold boot.
	JMP  0FFFFh:0000h	 ; reboot!

msg 	DB  'Welcome, I have control of the computer.',13,10
	DB  'Press any key to reboot.',13,10
	DB  '(after removing the floppy)',13,10,0
; end boot1

Another  bootable program is boot1a.asm

Very small bios test bios1.asm
This is for 64 bit computer:
; bios1.asm  use BIOS interrupt for printing
; Compiled and run using one Linux command line   
;  nasm -f elf64 bios1.asm && ld bios1.o && ./a.out   
	global  _start        ; standard ld main program

	section .text     
_start:

print1: mov rax,[ahal]
	int 10h		; write character to screen.
	mov rax,[ret]
	int 10h		; write new line '\n'
	mov rax,0
	ret
ahal:	dq 0x0E28	 ; output to screen ah has 0E
ret:	dq 0x0E0A	 ; '\n'
; end bios1.asm

Another small bios test bios1_32.asm
This is for 32 bit computer:
; bios1_32.asm  use BIOS interrupt for printing
; Compiled and run using one Linux command line   
;  nasm -f elf32 bios1_32.asm && ld bios1_32.o && ./a.out   
	global  _start        ; standard ld main program
	section .text     
_start:

print1: mov rax,[ahal]
	int 10h		; write character to screen.
	mov rax,[ret]
	int 10h		; write new line '\n'
	mov rax,0
	ret
ahal:	dd 0x0E28	 ; output to screen ah has 0E
ret:	dd 0x0E0A	 ; '\n'
; end bios1.asm



A more complete bootable program with subroutines and uses a
printer on lpt 0 is:
bootreg.asm

Project 3 is assigned
See Project 3

Lecture 10 Hardware Interface

Several views of computers to follow:

rip->instruction->decode->registers->alu->ear->data RAM etc.

First, a very complex computer architecture, the Itanium, IA_64
Just look at all those registers!

Then at end, three level cache.

 cs411_IA_64.pdf
Probably need to do   firefox cs411_IA_64.pdf
or use Adobe Reader  /afs/umbc.edu/users/s/q/squire/pub/www/images/cs411_IA_64.pdf

Block diagram of typical Intel computer.



Modern Intel computers do not directly execute our assembly language
instructions. Decoders are used to make a sequence of RISC instructions,
executed in the "simple architecture" below.



The computer architecture has a TLB, translation lookaside buffer
that translates the addresses you see in the debugger, "virtual addresses"
into a "physical address" actual RAM addresses.
That address then goes into a cache that may have the RAM  data or
instruction, thus avoiding the slow RAM access.
One TLB with cache looks like the figure below.



Now, a very simple architecture to follow and instruction execution.
This is after decoding instructions and TLB and cache.
PC program counter is the rip instruction pointer as a RAM address.
Instruction Memory would be  section .text
Data Memory would be  section .data  and  section .bss
ALU executes instructions  mov, add, sub, imul, idiv, shift, etc.
This architecture is not Intel.



Intel 82C55 needs assembly language, book page 396
peripheral interface adapter details

Lecture 11 Privileged instructions

The Intel 80x86 have privilege levels.
There are instructions that can only be executed at the highest
privilege level, CPL = 0. This would be reserved for the
operating system in order to prevent the average user from
causing chaos. e.g. The average user could issue a HLT instruction
to halt the machine and thus every process would be dead.
Other CPL=0 only instructions include:
  CLTS  Clear Task Switching flag in cr0
  INVP  Invalidate cache
  INVLPG Invalidate translation lookaside buffer, TLB
  WBINVD Write Back and Invalidate cache

It should be obvious that when running a multiprocessing operating
system, that there are many instructions that only the operating
system should use.

The operating system controls the resources of the computer,
including RAM, I/O and user processes. Some sample protections
are tested by the following sample programs:

A few simple tests to be sure protections are working.
These three programs result in segfault, intentionally.
safe_64.asm store into read only section
; safe_64.asm   for testing protections within sections
; Assemble:	nasm -f elf64  safe_64.asm
; Link:		gcc -o safe_64  safe_64.o
; Run:		./safe_64
; Output:
; it should stop with a system caught error

        global main		; the standard gcc entry point
        extern	printf		; the C function, to be called

        section .rodata		; read only data section, constants
a:	dq	5		; long int a=5;
fmt:    db "Bad, still running",10,0


        section .text           ; Code section. not writeable
main:				; the program label for the entry point
        push    rbp		; set up stack frame

	mov	rax,0x789abcde
	mov	[a],rax		; should be error, read only section  !!!!!!!!!!
        mov	rdi,fmt		; address of format string
	mov	rax,0
	call	printf

        pop     rbp
	mov	rax,0		; normal, no error, return value
	ret			; return
	

safe1_64.asm store into code section
; safe_64.asm   for testing protections within sections
; Assemble:	nasm -f elf64   safe1_64.asm
; Link:		gcc -o safe1_64  safe1_64.o
; Run:		./safe1_64
; Output:
; it should stop with a system caught error

        global main		; the standard gcc entry point
        extern	printf		; the C function, to be called

        section .rodata		; read only data section, constants
a:	dq	5		; long int a=5;
fmt:    db "Bad, still running",10,0


        section .text           ; Code section. not writeable
main:				; the program label for the entry point
        push    rbp		; set up stack frame

	mov	rax,0x789abcde
	mov	[main],rax	; should be error, can not change code .text !!!!!!
        mov	rdi,fmt		; address of format string
	mov	rax,0
	call	printf

        pop     rbp
	mov	rax,0		; normal, no error, return value
	ret			; return
	

safe2_64.asm jump (execute) data
; safe2_64.asm   for testing protections within sections
; Assemble:	nasm -f elf64  safe2_64.asm
; Link:		gcc -o safe2_64  safe2_64.o
; Run:		./safe2_64
; Output:
; it should stop with a system caught error

        global main		; the standard gcc entry point
        extern	printf		; the C function, to be called

        section .rodata		; read only data section, constants
a:	dq	5		; long int a=5;
fmt:    db "Bad, still running",10,0


        section .text           ; Code section. not writeable
main:				; the program label for the entry point
        push    rbp		; set up stack frame

	mov	rax,0x789abcde
	jmp	a		; should be error, can not execute data !!!!!!!!
        mov	rdi,fmt		; address of format string
	mov	rax,0
	call	printf

        pop     rbp
	mov	rax,0		; normal, no error, return value
	ret			; return
	

A few simple tests to be sure privileged instructions can not execute.
priv_64.asm hlt instruction to halt computer
; priv_64.asm   for testing that average user
;               can not execute privileged instructions 
; Assemble:	nasm -f elf64 priv_64.asm
; Link:		gcc -o priv_64  priv_64.o
; Run:		./priv_64
; Output:
; it should stop with a system caught error

        global main		; the standard gcc entry point
        extern	printf		; the C function, to be called
fmt:    db "bad! Still running",10,0	; The printf format, "\n",'0'


        section .text           ; try to halt the computer
main:				; the program label for the entry point
        push    rbp		; set up stack frame

	hlt			; should be error, only allowed in CPL=0  !!!!!!!

        mov	rdi,fmt		; address of format string
	mov	rax,0
	call	printf

        pop     rbp
	mov	rax,0		; normal, no error, return value
	ret			; return
	

	

priv1_64.asm other privileged instructions
; priv1_64.asm   for testing that average user
;                can not execute privileged instructions 
; Assemble:	nasm -f elf64 priv1_64.asm
; Link:		gcc -o priv1_64  priv1_64.o
; Run:		./priv1_64
; Output:
; it should stop with a system caught error

        global main		; the standard gcc entry point
        extern	printf		; the C function, to be called
fmt:    db "bad! Still running",10,0	; The printf format, "\n",'0'


        section .text           ; try to halt the computer
main:				; the program label for the entry point
        push    rbp		; set up stack frame

	clts			; should be error, only allowed in CPL=0  !!!!!!!
        wbinvd			; never gets to these, also error

	
        mov	rdi,fmt		; address of format string
	mov	rax,0
	call	printf

        pop     rbp
	mov	rax,0		; normal, no error, return value
	ret			; return
	

	

In order to allow the user some access, controlled access, to
system resources, an interface to the operating system, or kernel,
is provided. You will see in the next lecture that some BIOS
functions are also provided as Linux kernel calls.


Need for speed: Some Brief History:
  The ISA card slots were replaced by PCI card slots that
  are replaced by external USB devices. The
  serial port for RS232 devices is replaced by the USB port.
  Floppy disk are disappearing along with that connector on
  the motherboard. RAM still uses DIMM's and the slots have
  grown to handle 4, 8 and 16 gigabytes of memory. ATA hard
  drives are replaced by SATA hard drives, 4TB becoming available.
  Some rotating hard drives are being replaced by SSD, solid
  state drives. The printer port will be going as will the
  AGP graphics connector. That expensive graphics card you
  bought will probably not work in your new computer.

A standard engineering statement is:
Fast, Cheap, Reliable - pick any two.

The best method of measuring a computers performance
is to use benchmarks. Some suggestions from my
personal experience preparing a benchmark suite
and several updates and personal benchmark
experience are presented in pdf format.


Smaller time is better, higher clock frequency is better.
time = 1 / frequency   T = 1/F   and  F = 1/T
1 nanosecond = 1 / 1 GHz
1 microsecond = 1 / 1 MHz

Definitions:
CPI    Clocks Per Instruction
MHz    Megahertz, millions of cycles per second
MIPS   Millions of Instructions Per Second = MHz / CPI
MOPS   Millions of Operations Per Second
MFLOPS Millions of Floating point Operations Per Second
MIOPS  Millions of Integer Operations Per Second  


Do not trust your computers clock or the software
that reads and processes the time.

First: Test the wall clock time against your watch.

time_test.c
time_test.java
time_test.f90

   Click on above to see code.

The program displays 0, 5, 10, 15 ... at 0 seconds,
5 seconds, 10 seconds etc.

demonstrate time_test if possible



Note the use of <time.h> and 'time()'

Beware, midnight is zero seconds.
Then 60 sec/min * 60 min/hr * 24 hr/day = 86,400 sec/day
Just before midnight is 86,399 seconds.
Running a benchmark across midnight may give a negative time.


Then: Test CPU time, this should be just the time
used by the program that is running. With only
this program running, checking against your watch
should work. On a busy day on GL this could take 10 seconds
to give the first 5 second printout. This would need 16 students
running compute intensive programs.

time_cpu.c

  Click on above to see code.

The program displays 0, 5, 10, 15 ... at 0 seconds,
5 seconds, 10 seconds etc.

Note the use of <time.h> and 
  '(double)clock()/(double)CLOCKS_PER_SEC'

I have found one machine with the constant
CLOCKS_PER_SECOND completely wrong and
another machine with a value 64 that should
have been 100. A computer used for real time
applications could have a value of 1,000,000
or more.

A computer benchmark will typically be some code that is executed
and the running time measured. 

A few simple rules about benchmarks:

1) Do not believe or trust any person, any company, any data.

2) Expect the same code to give different times on:
   different operating systems,
   different compilers,
   different computers from various manufacturers
             (IBM, Sun, Intel, AMD) even at same clock speed,
             (IBM Power fastest, AMD next fastest with same memory, cache)
   different languages, even for line by line translation.

3) If you want to measure optimization, turn it on,
   otherwise prevent all optimization.
             (Most compilers provide optimization choices)
             (Add code to prevent inlining of functions, force store)

4) You will probably be using your computers clock to measure time.
   Test that the clock is giving valid results for the language
   you are using. The constant CLOCKS_PER_SEC in the "C" header
   file  time.h  has been found to be wrong.
   One manufacturer put a capacitor across the clock circuitry
   on a motherboard and all time measurements were half the
   correct time. See sample test below.

5) For measuring short times you will need to use the
   "double difference method". This method can be used to
   measure the time of a single instruction. This method
   should be used for any benchmark where one iteration of
   the code runs in less than a second. See sample test below.

6) Some methods of measuring time on a computer are only
   accurate to one second. Generally run enough iterations of
   your code in loops to get a ten second measurement.
   Some computers provide a real time clock as accurate as
   one microsecond, others one millisecond and some poorer than
   a fiftieth of a second.

7) Turn off all networking and stop all software that might run
   periodically. If possible, run in single user mode. You want to
   measure your code, not a virus checker or operating system.
   I once did measurement on a Sun computer running Solaris. It 
   seemed to slow down periodically. I found that the operating
   system periodically checked to see if any disk files needed
   to be written.

8) If you are interested in how fast your application might run
   on a new computer, find reputable benchmarks that are for
   similar applications. I do a lot of numerical computation, thus
   all my benchmarks are heavily floating point. You may be
   more interested in disk performance or network performance.

9) Do not run all all zero data. Some compilers and very smart and
   may precompute your result without running you code.
   Be sure to use every result. Compilers do "dead code elimination"
   that checks for code where the results are not used and just
   does not produce instructions for that "dead code." An "if" test
   or printing out the result is typically sufficient. For vectors
   and arrays, usually printing out one element is sufficient.

10) It helps to be paranoid. Check that you get the same results
    by running n iterations, then 2n iterations. If the time did
    not double, you do not have a stable measurement. Run 4n and 8n
    and check again. It may not be your benchmark code, it may be
    an operating system activity.

11) Do not run a benchmark across midnight. Most computers reset
    the seconds to zero at midnight.

12) Keep values of time as a double precision numbers.

13) Given an algorithm where you can predict the time increase
    as the size of data increases: e.g. FFT is order  n log2 n,
    multiplying a matrix by a matrix is order n^3, expect
    non uniform results for some values of n.

    Consider the case where all your code and all your data fit
    in the level one caches. This will be the fastest.

    Consider when you data is much larger than the level one cache
    yet fits in the level two cache. You are now measuring the
    performance of the level two cache.

    Consider when your data fits in RAM but is much larger than
    your level two (or three) cache. You are now measuring the speed
    of your code running in RAM.

    Consider when your data is much larger than your RAM, you are
    now running in virtual memory from your disk drive. This will
    be very slow and you are measuring disk performance.


The "Double Difference Method" tries to get accurate measurement
for very small times. The code to time a single floating point
add instruction is shown below. The principal is:

  measure time, t1

  run a test harness with loops that has everything except the code
  that you want to time. Count the number of executions as a check.

  measure time, t2

  measure time, t3

  run exactly the same code from the test harness with only the
  feature you want to measure added. Count number of executions.

  measure time, t4

  check that the number of executions is the same.
  check that  t2-t1 was more than 10 seconds

  the time for the feature you wanted to measure is

  t5 = ((t4 - t3) - (t2 - t1))/ number of executions

  basically measured time minus test harness time divided by the
  number of executions.

 /* time_fadd.c  try to measure time of double  A = A + B;         */
 /*              roughly time of one floating point add            */
 /*              using double difference and minimum and stability */

 #include <time.h>
 #include <stdio.h>
 #include <math.h>

 #define dabs(a) ((a)<0.0?(-(a)):(a))
 void do_count(int * count_check, int rep, double * B);

 int main(int argc, char * argv[])
 {
   double t1, t2, t3, t4, tmeas, t_min, t_prev, ts, tavg;
   double A, B, Q;
   int stable;
   int i, j;
   int count_check, outer;
   int rep, min_rep;


   t_min = 10.0;    /* 10.0 seconds typical minimum measurement time */
   Q  = 5.0;        /* 5.0 typical approximate percentage stability */
   min_rep = 32;    /* minimum of 32 typical */
   outer = 100000;  /* some big number */

   printf("time_fadd.c \n");
   printf("min time %g seconds, min stability %g percent, outer loop=%d\n",
          t_min, Q, outer);


   stable = 5; /* max tries */
   t_prev = 0.0;
   for(rep=min_rep; rep<100000; rep=rep+rep) /* increase until good measure */
   {
     A = 0.0;
     B = 0.1;
     t1 = (double)clock()/(double)CLOCKS_PER_SEC;
     for(i=0; i<outer; i++) /* outer control loop */
     {
       count_check = 0;
       for(j=0; j<rep; j++)   /* inner control loop */
       {
          do_count(&count_check, rep, &B);
       }
     }
     t2 = (double)clock()/(double)CLOCKS_PER_SEC;
     if(count_check != rep) printf("bad count_check_1 %d \n", count_check);

     A = 0.0;
     t3 = (double)clock()/(double)CLOCKS_PER_SEC;
     for(i=0; i<outer; i++) /* outer measurement loop */
     {
       count_check = 0;
       for(j=0; j<rep; j++)   /* inner measurement loop */
       {
         do_count(&count_check, rep, &B);
         A = A + B;   /* item being measured, approximately FADD time */
       }
     }
     t4 = (double)clock()/(double)CLOCKS_PER_SEC;
     if(count_check != rep) printf("bad count_check_2 %d \n", count_check);

     tmeas = (t4-t3) - (t2-t1); /* the double difference */
     printf("rep=%d, t measured=%g \n", rep, tmeas);

     if((t4-t3)<t_min) continue; /* need more rep */

     if(t_prev==0.0)
     {
       printf("tmeas=%g, t_prev=%g, rep=%d \n", tmeas, t_prev, rep);
       t_prev = tmeas;
     } 
     else /* compare to previous */
     {
       printf("tmeas=%g, t_prev=%g, rep=%d \n", tmeas, t_prev, rep);
       ts = 2.0*(dabs(tmeas-t_prev)/(tmeas+t_prev));
       tavg = 0.5*(tmeas+t_prev);
       if(100.0*ts < Q)  break; /* above minimum and stable */
       t_prev = tmeas;
     }
     stable--;
     if(stable==0) break;
     rep = rep/2; /* hold rep constant */
   } /* end loop increasing rep */

   /* stable? and over minimum */
   if(stable==0) printf("rep=%d  unstable \n", rep);
   if(tmeas<t_min) printf("time measured=%g, under minimum \n", tmeas);
   printf("raw time=%g, fadd time=%g, rep=%d, stable=%g\% \n\n", tmeas, 
          (tavg/(double)outer)/(double)rep, rep, ts);
   return 0;
 } /* end time_fadd.c */

 /* do_count to prevent dead code elimination  */
 void do_count(int * count_check, int rep, double * B)
 {
   (*count_check)++;
   /* could change B but probably don't have to. */
 }

time_fadd_sgi.out

Lecture 12 Linux kernel calls


For modern 64-bit computers:
The system call numbers are the same, %eax becomes %rax, %ebx becomes %rbx, etc.
"unsigned int" becomes "unsigned long int", size_t is 64-bit, etc.
System Call Table

Load registers then   int  0x80   
Another web site http://asm.sourceforge.net/syscall.html

When making Linux kernel calls from a "C" program, you will need
#include <unistd.h>

A tiny sample, using only system calls, that prints a heading
syscall0_64.asm
; syscall0_64.asm   demonstrate system, kernel, calls
; Compile:	nasm -f elf64 syscall0_64.asm
; Link		gcc -o syscall0_64 syscall0_64.o
; Run:		./syscall0_64
;
	section	.data
msg:	db  "syscall0_64.asm running",10	; the string to print, 10=crlf
len:	equ $-msg     ; "$" means here, len is a value, not an address
	global	main
	section .text
main:

; header msg			; these 5 lines are like  printf
	mov	rdx,len		; arg3, length of string to print
	mov	rcx,msg		; arg2, pointer to string
	mov	rbx,1		; arg1, where to write, screen
	mov	rax,4		; write command to int 80 hex
	int	0x80		; interrupt 80 hex, call kernel
	
	mov	rbx,0		; exit code, 0=normal
	mov	rax,1		; exit command to kernel
	int	0x80		; interrupt 80 hex, call kernel




A sample syscall1_64.asm
demonstrates file open, file read (in hunks of 8192)
and file write (the whole file!)
This program reads and prints itself:

; syscall1_64.asm   demonstrate system, kernel, calls
; Compile:	nasm -f elf64 syscall1_64.asm
; Link		gcc -o syscall1_64 syscall1_64.o
; Run:		./syscall1_64
;

	section	.data
msg:	db  "syscall1_64.asm running",10	; the string to print, 10=crlf
len:	equ $-msg     ; "$" means here, len is a value, not an address
msg2:	db  "syscall1_64.asm finished",10
len2:	equ $-msg2
msg3:	db  "syscall1_64.asm opened",10
len3:	equ $-msg3
msg4:	db  "syscall1_64.asm read",10
len4:	equ $-msg4
msg5:	db  "syscall1_64.asm open fail",10
len5:	equ $-msg5
msg6:	db  "syscall1_64.asm another open fail",10
len6:	equ $-msg6
msg7:	db  "syscall1_64.asm read fail",10
len7:	equ $-msg7

name:	db  "syscall1_64.asm",0 ; "C" string also used by OS
fd:	dq  0			; file descriptor
flags:	dq  0			; hopefully read-only
	section .bss
line:	resb  8193		; read/write buffer 16 sectors of 512
lenbuf:	resq     1		; number of bytes read

	extern  open
	global	main
	section .text
main:
        push    rbp		; set up stack frame

; header msg
	mov	rdx,len		; arg3, length of string to print
	mov	rcx,msg		; arg2, pointer to string
	mov	rbx,1		; arg1, where to write, screen
	mov	rax,4		; write command to int 80 hex
	int	0x80		; interrupt 80 hex, call kernel

open1:	
	mov	rdx,0		; mode
	mov	rcx,0		; flags, 'r' equivalent O_RDONLY
	mov	rbx,name	; file name to open
	mov	rax,5		; open command to int 80 hex
	int	0x80		; interrupt 80 hex, call kernel
        mov	[fd],rax	; save fd
	cmp	rax,2		; test for fail
	jg	read		; file open

; file open failed msg5
	mov	rdx,len5	; arg3, length of string to print
	mov	rcx,msg5	; arg2, pointer to string
	mov	rbx,1		; arg1, where to write, screen
	mov	rax,4		; write command to int 80 hex
	int	0x80		; interrupt 80 hex, call kernel

read:	
; file opened msg3
	mov	rdx,len3	; arg3, length of string to print
	mov	rcx,msg3	; arg2, pointer to string
	mov	rbx,1		; arg1, where to write, screen
	mov	rax,4		; write command to int 80 hex
	int	0x80		; interrupt 80 hex, call kernel

doread:	
	mov	rdx,8192	; max to read
	mov	rcx,line	; buffer
	mov	rbx,[fd]	; fd
	mov	rax,3		; read command to int 80 hex
	int	0x80		; interrupt 80 hex, call kernel
	mov	[lenbuf],rax	; number of characters read
	cmp	rax,0		; test for fail
	jg	readok		; some read

; read failed msg7
	mov	rdx,len7	; arg3, length of string to print
	mov	rcx,msg7	; arg2, pointer to string
	mov	rbx,1		; arg1, where to write, screen
	mov	rax,4		; write command to int 80 hex
	int	0x80		; interrupt 80 hex, call kernel
	jmp	fail		; nothing read

; file read msg4
readok:	
	mov	rdx,len4	; arg3, length of string to print
	mov	rcx,msg4	; arg2, pointer to string
	mov	rbx,1		; arg1, where to write, screen
	mov	rax,4		; write command to int 80 hex
	int	0x80		; interrupt 80 hex, call kernel

write:	
	mov	rdx,[lenbuf]	; length of string to print
	mov	rcx,line	; pointer to string
	mov	rbx,1		; where to write, screen
	mov	rax,4		; write command to int 80 hex
	int	0x80		; interrupt 80 hex, call kernel

fail:	
; finished msg2
	mov	rdx,len2	; arg3, length of string to print
	mov	rcx,msg2	; arg2, pointer to string
	mov	rbx,1		; arg1, where to write, screen
	mov	rax,4		; write command to int 80 hex
	int	0x80		; interrupt 80 hex, call kernel
	
	mov	rbx,0		; exit code, 0=normal
	mov	rax,1		; exit command to kernel
	int	0x80		; interrupt 80 hex, call kernel


Now,  a little help with project 2

cmpe310_proj.shtml

Lecture 13 Review

Go over lectures 1 through 12.
Go over your project 1 solution

Sample exam will be handed out in class.

Almost all questions will be from web pages.

Closed book. Closed notes. No internet or cell phones.

Multiple choice, true-false, one number  questions.

You will not be asked to do any programming on exam.

There will be code, asking if it will assemble and
asking if it will work correctly.

Lecture 14 mid-term exam

33 questions: some true-false, some multiple choice,
some short answer, e.g. convert decimal to binary or binary to decimal
A few one line to three line assembly language questions.

Lecture 15 Memory hardware organization


8086 chip

Memory hardware organization



DDR4 available


some review:
For these notes:
  1 = true = high = value of a digital signal on a wire
  0 = false = low = value of a digital signal on a wire
  X = unknown or indeterminant to people, not on a wire

A digital logic gate can be represented at least three ways,
we will interchangeably use: schematic symbol, truth table or equation.
The equations may be from languages such as mathematics, VHDL or Verilog.

Digital logic gates are connected by wires. A wire or a group of
wires can be given a name, called a signal name. From an electronic
view the digital logic wire has a high or a low (voltage) but we
will always consider the wire to have a one (1) or a zero (0).

The basic logic gates are shown below.

The basic  "and"  gate:
truth table   equation      symbol   
 a b | c
 ----+--     c <= a and b;         
 0 0 | 0
 0 1 | 0     c = a & b;
 1 0 | 0
 1 1 | 1     c = and(a,b)


The basic  "and"  gate:
truth table   equation      symbol    
 a b c | d   d = and(a, b, c)  
 ------+--
 0 0 0 | 0  notice how a truth table has the inputs
 0 0 1 | 0  counting 0, 1, 2, ... in binary.
 0 1 0 | 0
 0 1 1 | 0  the output (may be more than one bit) is
 1 0 0 | 0  after the vertical line, on the right.
 1 0 1 | 0
 1 1 0 | 0
 1 1 1 | 1


The basic  "or"  gate:
truth table   equation      symbol   
 a b | c
 ----+--     c <= a or b;         
 0 0 | 0
 0 1 | 1     c = a | b;
 1 0 | 1
 1 1 | 1     c = or(a,b)


The basic  "nand"  gate:
truth table   equation      symbol   
 a b | c
 ----+--     c <= a nand b;         
 0 0 | 1
 0 1 | 1     c = ~ (a & b);
 1 0 | 1
 1 1 | 0     c = nand(a,b)


The basic  "nor"  gate:
truth table   equation      symbol   
 a b | c
 ----+--     c <= a nor b;         
 0 0 | 1
 0 1 | 0     c = ~ (a | b);
 1 0 | 0
 1 1 | 0     c = nor(a,b)


The basic  "xor"  gate:
truth table   equation      symbol   
 a b | c
 ----+--     c <= a xor b;         
 0 0 | 0
 0 1 | 1     c = a ^ b;
 1 0 | 1
 1 1 | 0     c = xor(a,b)


The basic  "xnor"  gate:
truth table   equation      symbol   
 a b | c
 ----+--     c <= a xnor b;         
 0 0 | 1
 0 1 | 0     c = ~ (a ^ b);
 1 0 | 0
 1 1 | 1     c = xnor(a,b)


The basic  "not"  gate:
truth table   equation      symbol   
 a | b
 --+--     b <= not a;         
 0 | 1
 1 | 0     b = ~ a;

           b = not(a)


It is known that there are 16 Boolean functions with two inputs.
In fact, for any number of inputs, n, there are  2^(2^n)  Boolean
functions ( two to the power of two to the nth).
For  n=2      16 functions  2^4
     n=3     256 functions  2^8
     n=4  65,536 functions  2^16
     n=5  over four billion functions  2^32

The truth table for all Boolean functions of two inputs is

                     n   x
         n         x a a n
         o   _   _ o n n o   1   1 o
 a b | 0 r 2 a 4 b r d d r b 1 a 3 r 1
 ----+--------------------------------
 0 0 | 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
 0 1 | 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
 1 0 | 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
 1 1 | 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

Notice that for two input variables, a  b, there are 2^2 = 4 rows
Notice that for four rows there are 2^4 = 16 columns.

A question is: Which are "universal" functions from which all
other functions can be obtained?

The answer is that either "nand" or "nor" can be used to create
all other functions (when having 0 and 1 available). It turns out
that electric circuits rather naturally create "nand" or "nor"
gates. No more than five "nand" gates or five "nor" gates are
needed in creating any of the 16 Boolean functions of two inputs.

Here are the circuits using only "nand" to get all 16 functions.

Then, there is more than just 0 and 1  values a boolean variable may have: 
H = 1     high, weak
L = 0     low, weak
X =       unknown
Z =       high impedance, usually open circuit, or bus
W =       unknown, weak
U =       uninitialized
- =       don't care = X = unknown

Truth tables for  and, or, ...  using all VHDL std_logic signal values:
t_table.out

Generated by program, digital logic simulator, VHDL
t_table.vhdl

Similarly, for verilog
t_table_v.out

Generated by program, digital logic simulator, VHDL
t_table.v

Some of these values only become useful when two gate outputs
are connected together, this may be a "wired and" or
a "wired or" depending on how the circuit is implemented
in silicone. If two wires are connected together and one
is high impedance, Z, then the other wire would control.
Also, weak will be controlled by the other wire.

Lecture 16 Memory decoding and wiring


Suppose you had a processor with 32 memory address bits,
capable of addressing 4GB of RAM, yet you had four old
1GB RAM chips laying around:
OK, wire processor address bits A0..A29 to all four 1GB RAM chips.
Add gates to decode A30, A31 two bits into four signals
CS00, CS01, CS10, CS11 and wire one of these to each RAM chip CS,
Chip Select. Now your processor has 4GB of RAM.
(Note: if RAM outputs 16 bits, two bytes, A0 is unused,
 for 32 bit output A0,A1 unused, for 64 bit output A0,A1,A2 unused)

Memory decoding and wiring
Book chapter 10

16L8 PAL used in some above memory
Book page 347.



some review
Combinational digital logic uses Boolean Algebra.
The basic relations are well known, yet several notations
are used.

Notation A: use words "and" "or" "not" etc.
Notation B: use characters  & for "and", | for "or", ~ for "not"
Notation C: use characters  * for "and", + for "or", - for "not"
Notation D: use symbols  "dot" for "and", + for "or", bar for "not"
Notation E: use symbols  "blank" for "and", + for "or", bar for "not"

Generally, the symbols for "and" are like the symbols for multiply,
           the symbols for  "or" are like the symbols for addition.
           In mathematics, multiplication always has precedence over addition,
           do not expect "and" to always have precedence over "or.

Here are 19 basic identities that can be used to simplify
or convert one Boolean equation to another.

1.   X + 0 = X   "or" anything with zero gives anything
2.   X * 1 = X   "and" anything with one gives anything
3.   X + 1 = 1   "or" anything with one gives one
4.   X * 0 = 0   "and" anything with zero gives zero
5.   X + X = X   "or" with self gives self
6.   X * X = X   "and" with self gives self
         _
7.   X + X = 1   "or" with complement gives one
         _
8.   X * X = 0   "and" with complement gives zero
9.   not(not(X)) = X  any even number of complements cancel
10.  X + Y = Y + X   "or" is commutative
11.  X * Y = Y * X   "and" is commutative
12.  X + (Y + Z) = (X + Y) + Z  "or" is associative
13.  X * (Y * Z) = (X * Y) * Z  "and" is associative
14.  X * (Y + Z) = X * Y + X * Z     distributive law
15.  X + Y * Z   = (X + Y) * (X + Z) distributive law
             _________
               _   _
16.  X + Y = ( X * Y )   DeMorgan's theorem
     _________   _   _
17.  ( X + Y ) = X * Y   DeMorgan's theorem
     _________   _   _
18.  ( X * Y ) = X + Y   DeMorgan's theorem
             _________
               _   _
19.  X * Y = ( X + Y )   DeMorgan's theorem

Basically, DeMorgan's theorem says:
Convert "and" to "or", negate the variables and negate the entire expression.

Convert "or" to "and", negate the variables and negate the entire expression.

Any truth table can be converted to a equation or schematic.

Given any truth table, there is a simple procedure for generating
a Boolean equation that uses "and", "or" and "not" (any representation).

First an example:
Given truth table
      a b | c     for each row where 'c' is 1,              _
      ----+--     create an "and" with 'a' if 'a' is 1, or 'a' if 'a' is 0 
      0 0 | 1                                               _
      0 1 | 0                     with 'b' if 'b' is 1, or 'b' if 'b' is 0
      1 0 | 1                       _   _
      1 1 | 1     thus, first row   a * b
                                        _
                        third row   a * b

                        fourth row  a * b

                  now, "or" the "and's" to form the final equation
                       _   _         _
                  c = (a * b) + (a * b) + (a * b)

                  c <= (not a and not b) or (a and not b) or (a and b);

The schematic can be drawn directly,
  one "and" gate for each row where 'c' is 1
    with a bubble for each variable that is 0


 
The general process to convert a truth table (or partial truth table)
to a Boolean equation using "and" "or" "not" is:
 For each output
   For each row where the output is 1
     create a minterm that is the "and" of the input variables with
     the input variable complemented when the input variable is 0.
   The output is the "or" of the above minterms.

Another example with three input variables and two outputs.

 a b c | s co
 ------+-----
 0 0 0 | 0 0         _ _       _   _       _ _
 0 0 1 | 1 0    s = (a*b*c) + (a*b*c) + (a*b*c) + (a*b*c)
 0 1 0 | 1 0
 0 1 1 | 0 1          _           _           _
 1 0 0 | 1 0    co = (a*b*c) + (a*b*c) + (a*b*c) + (a*b*c)
 1 0 1 | 0 1
 1 1 0 | 0 1
 1 1 1 | 1 1

The exact same information is presented by the schematic:



Note that this is not a minimum representation, we will talk
about minimizing digital logic in a few lectures.

Any equation can be converted to a truth table.

Example, convert  c <= (a and b) or (not a and b);
                                 _
                  c = (a * b) + (a * b)  to a truth table

We can immediately construct the truth table structure.
We see the input variables are 'a' and 'b' and the output is 'c'
We generate all possible values for input by counting in binary.

 a b | c
 ----+--
 0 0 |
 0 1 |
 1 0 |
 1 1 |

The only step that remains is to fill in the 'c' column.
For the first row, substitute 0 for 'a' and 0 for 'b' in the
equation and evaluate to find 'c'
                       _
        c = (0 * 0) + (0 * 0) = (0) + ( 1 * 0) = 0  (using identities above)
For the second row, substitute 0 for 'a' and 1 for 'b' in the
equation and evaluate to find 'c'
                       _
        c = (0 * 1) + (0 * 1) = (0) + ( 1 * 1) = 1  (using identities above)

For the third row, substitute 1 for 'a' and 0 for 'b' in the
equation and evaluate to find 'c'
                       _
        c = (1 * 0) + (1 * 0) = (0) + ( 0 * 0) = 0  (using identities above)

For the fourth row, substitute 1 for 'a' and 1 for 'b' in the
equation and evaluate to find 'c'
                       _
        c = (1 * 1) + (1 * 1) = (1) + ( 0 * 1) = 1  (using identities above)

Filling in the values for 'c' gives the completed truth table:
 a b | c
 ----+--
 0 0 | 1
 0 1 | 0
 1 0 | 1
 1 1 | 0

Any digital logic schematic can be converted to a truth table.

Any equation can be converted to a schematic and any schematic
can be converted to an equation. When converting a schematic to
a truth table directly, you are simulating the actual behavior
of the digital logic.

From schematic below we can immediately construct the truth table structure.
We see the input variables are 'a' and 'b' and the output is 'c'
We generate all possible values for input by counting in binary.

 a b | c
 ----+--
 0 0 |
 0 1 |
 1 0 |
 1 1 |


Now, start by placing the values of input signals on the
input wires. a=0, b=0.

Note that signals other than inputs are labeled X for unknown.

Then, as shown on the sequence of figures, propagate the signals.
For each gate, use the gate input to compute the gate output.
This is actually how the hardware works. Each gate is continually
using the inputs to produce the output, with a small delay.
All gates operate in parallel. All gates operate all the time.



Working a little faster, apply truth table inputs
a=0, b=1, then a=1, b=0, and finally a=1, b=1.




Filling in the values for 'c' gives the completed truth table:
 a b | c
 ----+--
 0 0 | 0
 0 1 | 1
 1 0 | 0
 1 1 | 1


The previous lecture shows generating truth tables
with VHDL or Verilog logic simulators.

Lecture 17 Memory RAM, DRAM


Memory RAM, DRAM

Static RAM using 6 transistors per bit. One for set, one for reset,
four in a basic cross coupled nand gate flipflop.




A mosfet dynamic ram, one transistor and one capacitor.
The capacitor must be refreshed often. RAS mode:


Mosfet dynamic ram CAS mode:


Recent generations of DRAM chips contain an integral refresh counter,
and the memory control circuitry can either use this counter or provide 
a row address from an external counter. These chips have three standard 
ways to provide refresh, selected by different patterns of signals on the 
"column select" (CAS) and "row select" (RAS) lines.

"RAS only refresh" - In this mode the address of the row to refresh 
is provided by the address bus lines, so it is used with external 
counters in the memory controller.

"CAS before RAS refresh" (CBR) - In this mode the on-chip counter keeps 
track of the row to be refreshed and the external circuit merely 
initiates the refresh cycles. This mode uses less power because the memory 
address bus buffers don't have to be powered up. It is used in most modern 
computers.

"Hidden refresh" - This is an alternate version of the CBR refresh cycle 
which can be combined with a preceding read or write cycle. The refresh 
is done in parallel during the data transfer, saving time.

In the latest (2012) generation of chips the "RAS only" mode has been 
eliminated, and the internal counter is used to generate refresh. 
The chip has an additional "sleep mode", for use when the computer 
is in hibernation, in which an on-chip oscillator generates internal 
refresh cycles so that the external clock can be shut down.



some review
"Combinational logic" means gates connected together without feedback.
There is no storage of information. Inputs are applied and outputs
are produced. By convention, we draw combinational logic from
inputs on the left to outputs on the right. For large schematic
diagrams this convention is often violated.

When no constraints are given, any of the gates previously
defined can be connected to design a circuit that performs
the stated function.

Example: Design a circuit that has:
  an input for tail lights both on
  an input for right turn that lets the signal "osc" control right tail light.
  an input for left turn that lets the signal "osc" control left tail light.
  ("osc" will make the light flash on and off as a turn indicator.)

  Constraint: use "and" and "or" gates with inversion bubbles allowed

Solution: There are four inputs "tail" "right" "left" and "osc"
          There are two outputs "right_light" and "left_light"

  The general strategy in design is to work backward from an output.
  Yet, as usual, some work from input toward output is also used.

  "right_light" must select between "tail" and "osc". Selection
  can typically be implemented by "and" gates feeding an "or" gate
  with a control signal into one "and" gate and its complement into
  the other "and" gate.
  
  Analyzing this circuit, if "right" is off, "tail" controls
  the "right_light". If "right is on, "osc" controls the "right_light".

A common symbol for this circuit is a multiplexor, mux for short.
The same circuit as above is usually drawn as the schematic diagram:



  Now we can use the first schematic with new labeling for
  the "left_light", combining the circuits yields:
  

Now a new requirement is added, the flashers must over ride all
other signals and make "osc" drive both right and left tail lights.

A typical design technique is to build on existing designs,
thus note that "flash" only needs to be able to turn on both
the old "right" and old "left". This is two "or" functions
that are easily added to the previous circuit.
  

In general a multiplexor can have any number of inputs.
Typically the number of inputs is a power of two and the
control signal, ctl, has the number of bits in the power.
  

 ctl | out  Note that "ctl" is a two bit signal, shown by the "2"
 ----+----
 0 0 |  a   The truth table does not have to expand
 0 1 |  b   a, b, c and d  because the mux just passes
 1 0 |  c   the values through to "out" based on the
 1 1 |  d   value of "ctl"


For a general circuit that has some type of description, we use
a rectangle with some notation indicating the function of the
circuit. The inputs and outputs are given signal names.
  

Homework 4 is assigned

Project 4 is assigned

Lecture 18 Memory DRAM, DDR, Flash


Memory DRAM, DDR, Flash
Based on Chapter 10, p328 in textbook, DDR comment on p373

some review:
There are many simulation and design tools available for digital logic.

There are major commercial Electronic Design Automation, EDA, systems
for todays digital logic. Cadence is one of todays major suppliers and
UMBC has Cadence software available on GL computers.
Mentor Graphics, Synopsis and others provide large tool sets.

Altera and Xilinx are major providers of software for making custom
integrated circuits using Field Programmable Gate Arrays, FPGA.
 www.altera.com 
Altera has a downloadable student version.

 www.xilinx.com 
See video ZYNQ relates to DDR4

 free download xilinx


The best WEB site to find free EDA tools is www.geda.seul.org


FPGA and other CAD information 
You can get working chips from VHDL using synthesis tools.
No connecting wires between components.

One of the quickest ways to get chips is to use FPGA's,
Field Programmable Gate Arrays.
The two companies listed below provide the software and the
foundry for you to design your own integrated circuit chips:

 www.altera.com 

 www.xilinx.com 

Complete Computer Aided Design, CAD, packages are available from
companies such as Cadence, Mentor Graphics and Synopsis.


For projects for this section of CMPE 310 we will not be using Cadence VHDL
and Cadence Verilog that are available on linux.gl.umbc.edu.
You have probably had Verilog and will get VHDL in CS411.



Using Cadence VHDL on Linux.GL machines
  First: You must have an account on a GL machine. Every student
         and faculty should have this, you can use Lab ITE 375.
         Either log in directly to linux.gl.umbc.edu or
         Use:   
         ssh  linux.gl.umbc.edu  or  log in at ITE 375
         cd cmpe310  # or replace this with the directory you are using

         Then do your own thing with Makefile for other VHDL files

         Remember each time you log on to do simulations:
         source vhdl_cshrc
         make -f Makefile_vhdl1             # or do your own thing.





"Hello World sample programs in VHDL
As usual, learn a language by starting with a simple "hello" program:

VHDL  hello.vhdl
hello.run used in simulation
hello_vhdl.out output of simulation

Note two major parts of a VHDL program:
The "entity" is the interface, and the "architecture" is the implementation.
     hello                              circuits

-- hello.vhdl  Just output to the screen
-- compile and run commands
-- 	ncvhdl -v93 hello.vhdl
--	ncelab -v93 hello:circuits
--	ncsim -batch  -logfile hello_vhdl.out -input hello.run hello


entity hello is  -- test bench (top level like "main")
end entity hello;

library STD;
use STD.textio.all;                     -- basic I/O
library IEEE;
use IEEE.std_logic_1164.all;            -- basic logic types
use IEEE.std_logic_textio.all;          -- I/O for logic types

architecture circuits of hello is -- where declarations are placed
  subtype word_32 is std_logic_vector(31 downto 0);
  signal four_32 : word_32 := x"00000004";    -- just four
  signal counter : integer := 1;              -- initialized counter
  alias swrite is write [line, string, side, width] ;
begin  -- where code is placed
  my_print : process is
               variable my_line : line;  -- type 'line' comes from textio
             begin
               write(my_line, string'("Hello VHDL"));  -- formatting
               writeline(output, my_line);              -- write to "output"
               swrite(my_line, "four_32 = ");     -- formatting with alias
               hwrite(my_line, four_32); -- format type std_logic_vector as hex
               swrite(my_line, "  counter= ");
               write(my_line, counter);  -- format 'counter' as integer
               swrite(my_line, " at time ");
               write(my_line, now);                     -- format time
               writeline(output, my_line);              -- write to display
               wait;
             end process my_print;
end architecture circuits;


 Verilog hello.v
hello_v.out output of simulation

// hello.v   First Verilog program
// command to compile and run
//         verilog -q -l hello_v.out hello.v

module hello;
  reg [31:0] four_32;
  integer counter;
   
  initial
    begin
      four_32 = 32'b00000000000000000000000000000100; 
      counter = 1;
      $display("Hello Verilog");
      $display("%b", four_32);
      $display("counter = %d", counter); 
    end
endmodule // hello

// output
// Hello Verilog
// 00000000000000000000000000000100
// counter =           1

Lecture 19 Input Output wiring


8255 data sheet used in HW 5, Proj 4

Input Output wiring


some review:
Basic decimal addition (with carry digit shown)
  101  <- carry (note that three numbers are added after first digit)

   567
 + 526
 -----
  1093

Binary addition (with carry bit shown)
  1011  <- carry (note that three bits are added after first bit)
           for future reference c(3)=1, c(2)=0, c(1)=1, c(0)=1
   1011    bits are numbered from zero, right to left
 + 1001
  -----
  10100    for future reference s(3)=0, s(2)=1, s(1)=0, s(0)=0
           the leftmost '1' is cout

Since three bits must be added, a truth table for a full adder
needs three inputs and thus eight entries.

 a b c | s co
 ------+-----        _ _       _   _       _ _
 0 0 0 | 0 0    s = (a*b*c) + (a*b*c) + (a*b*c) + (a*b*c) 
 0 0 1 | 1 0        simplifies to
 0 1 0 | 1 0    s = a xor b xor c
 0 1 1 | 0 1    s <= a xor b xor c;
 1 0 0 | 1 0          _           _           _
 1 0 1 | 0 1    co = (a*b*c) + (a*b*c) + (a*b*c) + (a*b*c)
 1 1 0 | 0 1         simplifies to
 1 1 1 | 1 1    co = (a*b)+(a*c)+(b*c)
                co <= (a and b) or (a and c) or (b and c);

This can be drawn as a box for use on larger schematics

      +-------+
      | a b c |  The inputs are shown at the top (or left)
      |       |
      | fadd  |
      |       |
      | co  s |  The outputs are shown at the bottom (or right)
      +-------+

The full adder can be written as an entity in VHDL

entity fadd is               -- full stage adder, interface
  port(a  : in  std_logic;
       b  : in  std_logic;
       c  : in  std_logic;
       s  : out std_logic;
       co : out std_logic);
end entity fadd;

architecture circuits of fadd is  -- full adder stage, body
begin  -- circuits of fadd
  s <= a xor b xor c after 1 ns;
  co <= (a and b) or (a and c) or (b and c) after 1 ns;
end architecture circuits; -- of fadd

The full adder can be written as a module in Verilog

module fadd(a, b, cin, sum, cout); // from truth table
  input  a;     // a input
  input  b;     // b input
  input  cin;   // carry-in
  output sum;   // sum output
  output cout;  // carry-out
  assign sum = (~a*~b*cin)+(~a*b*~cin)+(a*~b*~cin)+(a*b*cin);
  assign cout = (a*b)+(a*cin)+(b*cin); // last term redundant
endmodule // fadd



Connecting four full adders, four fadd's, to make a 4-bit adder

 

The connections are written for VHDL as 

  a0: entity WORK.fadd port map(a(0), b(0),  cin, s(0), c(0));
  a1: entity WORK.fadd port map(a(1), b(1), c(0), s(1), c(1));
  a2: entity WORK.fadd port map(a(2), b(2), c(1), s(2), c(2));
  a3: entity WORK.fadd port map(a(3), b(3), c(2), s(3), cout);

The connections are written for Verilog as 

  // instantiate modules
  fadd bit0(a[0], b[0], cin,  sum[0], c[0]);
  fadd bit1(a[1], b[1], c[0], sum[1], c[1]);
  fadd bit2(a[2], b[2], c[1], sum[2], c[2]);
  fadd bit3(a[3], b[3], c[2], sum[3], cout);


Note that the carry out of the previous stage is wired into
the carry input of the next higher stage. In a computer,
four bits are added to four bits and this produces four bits of sum.
The last carry bit, c(3) here, is usually called 'cout' and is
not called a 'sum' bit.


The VHDL circuit was simulated with
a(3)=0, a(2)=0, a(1)=0, a(0)=1   cin=0
b(3)=1, b(2)=1, b(1)=1, b(0)=1

There is a small delay time from the input to the output.
When a circuit is simulated, the initial values of signals
are shown as 'U' for uninitialized. As the circuit simulation
proceeds, the 'U' are computed and become '0' or '1'.
Partial output from the VHDL simulation shows this propagation.
(the upper line is logic '1', the lower line is logic '0')

s(0)  UU_____________________________
                                     
s(1)  UUUUUU_________________________
                                     
s(2)  UUUUUUUUUU_____________________
                                     
s(3)  UUUUUUUUUUUUUU_________________
         ____________________________
c(0)  UU                             
             ________________________
c(1)  UUUUUU                         
                 ____________________
c(2)  UUUUUUUUUU                     
                     ________________
c(3)  UUUUUUUUUUUUUU                 

At the end of the simulation the values are:
s(0)=0, s(1)=0, s(2)=0, s(3)=0, c(0)=1, c(1)=1, c(2)=1, c(3)=1
 

The full VHDL code is   add_trace.vhdl

The run file is         add_trace.run

The full output file is add_trace.out

A fragment of the Makefile is Makefile.add_trace

The Verilog code is
add4.v
add4_v.out
The Verilog output ran three cases:
add4.v running
a=1011, b=1000, cin=1, sum=0100, cout=1
a=0000, b=0000, cin=0, sum=0000, cout=0
a=1111, b=1111, cin=1, sum=1111, cout=1
L52 "add4.v": $finish at simulation time 15

Subtract
Given that the computer can "add" it now has to be able to "subtract."
Thus, a representation has to be chosen for negative numbers.
All computers have chosen the left most bit (also called the
high-order bit) to be the sign bit. The convention is that a '1'
in the sign bit means negative, a '0' in the sign bit means positive.
Within these conventions, three representations have been used
in computers: two's complement, one's complement and sign magnitude.
All bits are shown for 4-bit words in the table below.

 decimal   twos complement  ones complement  sign magnitude
       0      0000            0000             0000
       1      0001            0001             0001
       2      0010            0010             0010
       3      0011            0011             0011
       4      0100            0100             0100
       5      0101            0101             0101
       6      0110            0110             0110
       7      0111            0111             0111
      -8      1000             -                -
      -7      1001            1000             1111
      -6      1010            1001             1110
      -5      1011            1010             1101
      -4      1100            1011             1100
      -3      1101            1100             1011
      -2      1110            1101             1010
      -1      1111            1110             1001
      -0       -              1111             1000

We could choose to build a subtractor that uses a borrow, yet
this would require as many gates as were needed for the adder.
By choosing the two's complement representation of negative
numbers, an adder with a relatively low gate count multiplexor
and inverter can become a subtractor. The implementation follows
the definition of a negative number in two's complement
representation: invert the bits and add one.


Given a new symbol for an adder, the complete circuit for
doing 4-bit add and subtract becomes:



When the signal "subtract" is '1' the circuit subtracts 'b' from 'a'.
When the signal "subtract" is '0' the circuit adds 'a' to 'b'.

The basic circuit is written for VHDL as:

  a4: entity work.add4 port map(a, b_mux, subtract, sum, cout);
  i4: b_bar <= not b;
  m4: entity work.mux4 port map(b, b_bar, subtract, b_mux);

The general rule is that each circuit component symbol on
a schematic diagram will become one VHDL statement.
There are many other VHDL statements needed to run a complete
simulation.

The annotated output of the simulation is:

subtract=0, a=0100, b=0010, sum=0110  4+2=6
subtract=1, a=0100, b=0010, sum=0010  4-2=2
subtract=0, a=1100, b=0010, sum=1110  (-4)+2=(-2)
subtract=1, a=1100, b=0010, sum=1010  (-4)-2=(-6)
subtract=0, a=1100, b=1110, sum=1010  (-4)+(-2)=(-6)
subtract=1, a=1100, b=1110, sum=1110  (-4)-(-2)=(-2)
subtract=0, a=0011, b=1110, sum=0001, 3+(-2)=1
subtract=1, a=0011, b=1110, sum=0101, 3-(-2)=5


The full VHDL code is   sub4.vhdl

The run file is         sub4.run

The full output file is sub4.out

A fragment of the Makefile is Makefile.sub4

Somewhat similar Verilog code, using 4 bit mux
sub4.v
sub4_v.out

Checking both add and subtract:
sub4.v running
add
a=1011, b=1000, cin=1, sum=0100, cout=1
in0=1000, in1=0111, ctl=0, b=1000
subtract
a=1011, b=0111, cin=1, sum=0011, cout=1
in0=1000, in1=0111, ctl=1, b=0111
add
a=0000, b=0000, cin=0, sum=0000, cout=0
in0=0000, in1=1111, ctl=0, b=0000
subtract
a=0000, b=1111, cin=0, sum=1111, cout=0
in0=0000, in1=1111, ctl=1, b=1111
add
a=1111, b=1111, cin=1, sum=1111, cout=1
in0=1111, in1=0000, ctl=0, b=1111
subtract
a=1111, b=0000, cin=1, sum=0000, cout=1
in0=1111, in1=0000, ctl=1, b=0000
L113 "sub4.v": $finish at simulation time 30


Homework 5 is assigned

Lecture 20 Input Output devices


Keyboard control
display from typing some keys
test_keysym.py window gets and shows text and number, 16 bits 
keysym=a, keysym_num=97    
keysym=Shift_L, keysym_num=65505  
keysym=A, keysym_num=65  
keysym=z, keysym_num=122  
keysym=Shift_L, keysym_num=65505  
keysym=Z, keysym_num=90  
keysym=0, keysym_num=48  
keysym=9, keysym_num=57  
keysym=Return, keysym_num=65293
All non letter and number keys have 65xxx values. 

Additional information on keycodes and keysym are here  


Input Output devices


some review:
Multiplication and division are taught in elementary school, yet
they are still being worked on for computer applications.

The earliest computers just provided add and subtract with
conditional Branch, leaving the programmer to write multiply
and divide subroutines.

Early computers used bit-serial methods that required about
N squared clock times for multiplying or dividing N-bit numbers.

With a parallel adder, the time for multiply was reduced to
N/2 clock times (Booth algorithm) and division N clock times.

Todays computers use parallel, combinational, circuits for
multiply and divide. These circuits still take too long for
signals to propagate in one clock time. The combinational
circuits are "pipelined" so that a multiply or divide can be
completed every clock time.

Consider multiplying unsigned numbers  1010 * 1100  (10 times 12)
Using a hand method would produce:
      1010
    * 1100
 ---------
      0000  <- think of the multiplier bit being "anded" with
     0000      the multiplicand. A 1-bit "and" in digital logic
    1010       is like a 1-bit "multiply". 
   1010
 ---------
  01111000  4-bits times 4-bits produces an 8-bit product

When adding by hand, we can add the middle columns four bits and
produce a sum bit and possibly a carry. In hardware the number
of input bits is fixed. From the previous lecture, we could use
four 4-bit adders with additional "and" gates to do the multiply.
A better design incorporates the "and" gate to do a 1-bit multiply
inside the previous lectures full adder. With this single building
block, that is easy to replicate many times, we get the following
parallel multiplier design.
 
  The 4-bit by 4-bit multiply to produce an 8-bit unsigned product is

  
  

  The component  madd  circuit is

   

VHDL implementation is
   The VHDL source code is pmul4.vhdl
   The VHDL test driver is pmul4_test.vhdl
   The VHDL output is pmul4_test.out

   The Cadence run file is pmul4_test.run

   The partial Makefile is Makefile.pmul4_test


Verilog implementation is
   The Verilog source code is mul4.v
   The Verilog output is mul4_v.out


  Notice that the only component used to build the multiplier
  is "madd" and some uses of "madd" have constants as inputs.
  It is technology dependent whether the same circuit is used
  or specialized, minimized, circuits are substituted.

Division is performed by using subtraction. A sample unsigned binary
division of an 8-bit dividend by a 4-bit divisor that produces
a 4-bit quotient and 4-bit remainder is:

                1010  <- quotient
          /---------
     1100/  01111011  <- dividend
            -1100
            -----
              0110
             -0000
             ------
               1101
              -1100
             ------
                0011  
               -0000
               -----
                0011 <- remainder
 
With a parallel adder and a double length register, serial division
can be performed. Conventional division requires a trial subtraction
and possibly a restore of the partial remainder. A non restoring
serial division requires N clock times for a N-bit divisor.

  The schematic for a parallel 8-bit dividend divided by 4-bit divisor
  to produce an 4-bit quotient and 4-bit remainder is:

  


Notice that the building block is similar to the 'madd' component
in the parallel multiplier. The 'cas' component is the same full
adder with an additional xor gate.

   The VHDL test driver is divcas4_test.vhdl
   The VHDL output is divcas4_test.out
   The Cadence run file is divcas4_test.run
   The partial Makefile is Makefile.divcas4_test

   The Verilog code is div4.v
   The Verilog output is div4_v.out


Divide can create on overflow condition. This is typically handled by
separate logic in order to keep the main circuit neat. There is a
one bit preshift of the dividend in the manual, serial and parallel
division. Thus, no dividend bit number seven appears on the parallel
schematic.

Lecture 21 Input Output 3 more devices


Serial bus, some were 9 pin, some were 15 pin, replaced by USB.
A modem modulator-demodulator was used to have computer access to
remote places over phone lines. Some initially called blackboards,
befor the Internet. Modems use serial bit two way transmission.
The serial bus went through many speed increases shown with
approximate dates in the following table:

approx    speed      used by me
date
1958      110 baud   
1962      300 baud   hand set phone
1972     1200 baud   plug into phone line
         2400 baud
         4800 baud
         9600 baud   hard wired to computer to display
1991   14.400 baud   called 14 dot 4  kilobaud not used
       28.800 baud
1998   33.600 baud
Along the way there were many standards V.32 V.42 V.70  etc.
Bits were sent serially, start-bit, b1 b2 b3 b4 b5 b6 b7 b8 parity stop-bit
11 bits per byte. Thus 110 baud sent 10 bytes per second.
No problem for keyboard typing, slow for display. UART does control.

Input Output 3 more devices


some review:
A Karnaugh map, K-map, is a visual representation of a Boolean function.
The plan is to recognize patterns in the visual representation and
thus find a minimized circuit for the Boolean function.

There is a specific labeling for a Karnaugh map for each number
of circuit input variables. A Karnaugh map consists of squares where
each square represents a minterm. Notice that only one variable can
change in any adjacent horizontal or vertical square. Remember that
a minterm is the input pattern where there is a '1' in the output
of a truth table.

After the map is drawn and labeled, a '1' is placed in each square
corresponding to a minterm of the function. Later an 'X' will be
allowed for "don't care" minterms. By convention, no zeros are
written into the map.

Having a filled in map, visual skills and intuition are used to
find the minimum number of rectangles that enclose all the ones.
The rectangles must have sides that are powers of two.  No
rectangle is allowed to contain a blank square. The map is a toroid
such that the top row is logically adjacent to the bottom row and
the right column is logically adjacent to the left column. Thus
rectangles do not have to be within the two dimensional map.

The resulting minimized boolean function is written as a sum of
products. Each rectangle represents a product, "and" gate, and
the products are summed, "or gate", to produce the result. A rectangle
that contains both a variable and its complement does not have
that variable in the product term, omit the variable as an input
to the "and" gate.


   Basic labeling    Minterm numbers     Minterms 

        B=0 B=1           B=0 B=1           B=0 B=1
       +---+---+         +---+---+         +---+---+
   A=0 |   |   |     A=0 |m0 |m1 |     A=0 |__ |_  |
       +---+---+         +---+---+         |AB |AB |
   A=1 |   |   |     A=1 |m2 |m3 |         +---+---+
       +---+---+         +---+---+     A=1 | _ |   |
                                           |AB |AB |
                                           +---+---+

 Truth table        Karnaugh map    Covering with rectangles

   A B | F             B=0 B=1            B=0   B=1
   ----+--            +---+---+         +-----+-----+
   0 0 | 0        A=0 |   | 1 |         |     |+---+|
   0 1 | 1  m1        +---+---+     A=0 |     || 1 ||
   1 0 | 1  m2    A=1 | 1 |   |         |     |+---+|
   1 1 | 0            +---+---+         +-----+-----+
                                        |+---+|     |
                                    A=1 || 1 ||     |
                                        |+---+|     |
                                        +-----+-----+
                            _     _
   Minimized function   F = AB + AB

   Note: For each covering rectangle, there will be exactly one
   product term in the final equation for the function.
   Find the variable(s) that are both 1 and 0 in the rectangle.
   Such variables will not appear in the product term. Take any
   minterm from the covering rectangle, replace 1 with the variable,
   replace 0 with the complement of the variable. Cross out the
   variables that do not appear. The result is exactly one product
   term needed by the final equation of the function.





It is possible to have minterms that are don't care. For these
minterms, place an "X" or "-" in the Karnaugh map rather than
a one. The covering follows the obvious extended rule.
Covering rectangles may include any don't care squares.
Covering rectangles do not have to include don't care squares.
No rectangle can enclose only don't care squares.

Quine McClusky minimization

A tabular algorithm for producing the minimum two level sum of products
is know as the Quine McClusky method.

You may download and build the software that performs this minimization.
qm.tgz or link to a Linux executable
ln -s /afs/umbc.edu/users/s/q/squire/pub/linux/qm qm

The man page, qm.1 , is in the same directory.

The algorithm may be performed manually using the following steps:
1) Have available the minterms of the function to be minimized.
   There may be X's for don't care cases.

2) Create groups of minterms, starting with the minterms with the
   fewest number of ones.
   All minterms in a group must have the same number of ones and
   if any X's, the X's must be in the same position. There may be
   some groups with only one minterm.

3) Create new minterms by combining minterms from groups that
   differ by a count of one. Two minterms are combined if they
   differ in exactly one position. Place an X in that position
   of the newly created minterm. Mark the minterms that are
   used in combining (they will be deleted at the end of this step).
   Basically, take the first minterm from the first group. Compare
   this minterm to all minterms in the next group(s) that have
   one additional one. Repeat working until the last group is reached.

4) Delete the marked minterms.

5) Repeat steps 2) 3) and 4) until no more minterms are combined.

6) The minimized function is the remaining minterms, deleting any
   X's.

Example:
1) Given the minterms
  A B C D | F
  --------+--
  0 0 0 0 | 1  m0
  0 0 1 0 | 1  m2
  1 0 0 0 | 1  m8
  1 0 1 0 | 1  m10

2) Create groups
   m0  0 0 0 0   count of 1's is 0 
       -------
   m2  0 0 1 0   count of 1's is 1
   m8  1 0 0 0
       -------
   m10 1 0 1 0   count of 1's is 2

3) Create new minterms by combining
   Compare all in first group to all in second group
   m0 to m2  0 0 0 0
             0 0 1 0
             =======  they differ in one position
             0 0 X 0  combine and put an X in that position

   m0 to m8  0 0 0 0
             1 0 0 0
             =======  they differ in one position
             X 0 0 0  combine and put an X in that position

  Compare all in second group to all in third group
  m2 to m10  0 0 1 0
             1 0 1 0
             =======  they differ in one position
             X 0 1 0  combine and put an X in that position

  m8 to m10  1 0 0 0
             1 0 1 0
             =======  they differ in one position
             1 0 X 0  combine and put an X in that position

  no more candidates to compare.

4) Delete marked minterms (those used in any combining)
   (do not keep duplicates) Thus the minterms are now:
   0 0 X 0
   X 0 0 0
   X 0 1 0
   1 0 X 0

2) Repeat grouping (technically there are four groups, although
   the number of ones is either zero or one).
   0 0 X 0
   -------
   X 0 0 0
   -------
   X 0 1 0
   -------
   1 0 X 0

3) Create new minterms by combining
   0 0 X 0
   1 0 X 0  any X's must be the same in both
   =======  they differ in one position
   X 0 X 0  combine and put an X in that position

   X 0 0 0
   X 0 1 0
   =======  they differ in one position
   X 0 X 0  combine and put an X in that position

4) Delete marked minterms (those used in any combining)
   (do not keep duplicates) Thus the minterms are now:
   X 0 X 0

5) No more combining is possible.

6) The minimized function is the remaining minterms, deleting any
   X's. All remaining minterms are prime implicants

   A B C D            __
   X 0 X 0   thus F = BD

In essence, the Quine McClusky algorithm is doing the same
operations as the Karnaugh map. The difference is that no guessing
is used in the Quine McClusky algorithm and "qm" as it is called,
can be (and has been) implemented as a computer program.

A final note on labeling:
It does not matter what names are used for variables.
It does not matter in what order variables are used.
It does not matter if "-" or "X" is used for don't care.
It is important to keep a consistent relation between the bit
positions in minterms and the order of variables.

You may download and build the software that performs this minimization.
qm.tgz or link to a Linux executable
ln -s /afs/umbc.edu/users/s/q/squire/pub/linux/qm qm

The man page, qm.1 , is in the same directory.
More information is at Simulators and parsers

Lecture 22 Hardware Interrupts


Hardware Interrupts


some review:
We now focus on sequential logic. Logic with storage and state.
The previous lectures were on combinational logic, gates.

In order to build very predictable large digital logic systems,
synchronous design is used. A synchronous system has a special
signal called a master clock. The clock signal continuously
has values 0101010101010 ... . This is usually just a square
wave generator at some frequency. A clock with frequency 1 GHz
has a period of 1 ns. Half of the period the clock is a logical 1
and the other half of the clock period the clock is a logical 0. 

           ___     ___     ___
 clk   ___|   |___|   |___|

      |< 1 ns>|

   The VHDL code fragment to generate the  clk  signal is:
        signal clk : std_logic := '0';
      begin
        clk <= not clk after 500 ps;


A synchronous system is designed with registers that input a
value on a raising clock edge, hold the signal until the next
raising clock edge. The designer must know the timing of
combinational logic because the signals must propagate through
the combinational logic in less than a clock time.

Combinational logic can not have loops or feedback.
Sequential logic is specifically designed to allow loops and
feedback. The design rule is that and loop or feedback must
include a storage element (register) that is clocked.

      +------------------------------------+
      |                                    |
      |  +---------------+   +----------+  |
      +->| combinational |-->| register |--+
         | logic         |   |          |
         +---------------+   +----------+
                                  ^
                                  | clock signal


A register may be many bits and each bit is built from a flip flop.
A flip flop is ideally either in a '1' state or a '0' state.
The most primitive flip flop is called a latch. A latch can be made
from two cross coupled nand gates. The latch is not easy to work
with in large circuits, thus JK flip flops and D flip flops are
typically used. In modern large scale integrated circuits, the 
flip flops and thus the registers are designed at the device level.

A classical model of a JK flip flop is



On the raising edge of the clock signal,
   if J='1' the Q output is set to '1'
   if K='1' the Q output is set to '0'
   if both J and K are '1', the Q signal is inverted.

Note that Q_BAR is the complement of Q in the steady state.
There is a transient time when both could be '1' or both could be '0'.
The SET signal is normally '1' yet can be set to '0' for a short
time in order to force Q='1' (set the flip flop). 
The RESET signal is normally '1' yet can be set to '0' for a short
time in order to force Q='0' (reset the flip flop or register to zero).

A slow counter, called a ripple counter, can be made from JK flip
flops using the following circuit:



The VHDL source code for the entity JKFF, the JK flip flop,
and the four bit ripple counter is jkff_cntr.vhdl

The Cadence run file is jkff_cntr.run

The Cadence output file is jkff_cntr.out

ncsim: 04.10-s017: (c) Copyright 1995-2003 Cadence Design Systems, Inc.
ncsim> run 340 ns
q3, q2, q1, q0  q3_ q2_ q1_ q0_ clk
0   0   0   0   1   1   1   1   1  at  10 NS
0   0   0   1   1   1   1   0   1  at  30 NS
0   0   1   0   1   1   0   1   1  at  50 NS
0   0   1   1   1   1   0   0   1  at  70 NS
0   1   0   0   1   0   1   1   1  at  90 NS
0   1   0   1   1   0   1   0   1  at  110 NS
0   1   1   0   1   0   0   1   1  at  130 NS
0   1   1   1   1   0   0   0   1  at  150 NS
1   0   0   0   0   1   1   1   1  at  170 NS
1   0   0   1   0   1   1   0   1  at  190 NS
1   0   1   0   0   1   0   1   1  at  210 NS
1   0   1   1   0   1   0   0   1  at  230 NS
1   1   0   0   0   0   1   1   1  at  250 NS
1   1   0   1   0   0   1   0   1  at  270 NS
1   1   1   0   0   0   0   1   1  at  290 NS
1   1   1   1   0   0   0   0   1  at  310 NS
0   0   0   0   1   1   1   1   1  at  330 NS
      ________________________________________________________________
reset                                                                 
       _   _   _   _   _   _   _   _   _   _   _   _   _   _   _   _  
clk   | |_| |_| |_| |_| |_| |_| |_| |_| |_| |_| |_| |_| |_| |_| |_| |_
          ___     ___     ___     ___     ___     ___     ___     ___ 
q0    ___|   |___|   |___|   |___|   |___|   |___|   |___|   |___|   |
              _______         _______         _______         _______ 
q1    _______|       |_______|       |_______|       |_______|       |
                      _______________                 _______________ 
q2    _______________|               |_______________|               |
                                      _______________________________ 
q3    _______________________________|                               |

Ran until 340 NS + 0
ncsim> exit

In many designs, only one input is needed and the resulting flip flop
is a D flip flop. A D flip flop needs 6 nand gates rather than the
9 nand gates needed by the JK flip flop. There is a proportional
reduction is devices when the flip flop is designed from basic
transistors.



The VHDL source code for the entity DFF, the D flip flop,
and the four bit counter is dff_cntr.vhdl

The Cadence run file is dff_cntr.run

The Cadence output file is dff_cntr.out


The VHDL source code for the D flip flops:
dff.vhdl
Entity dff1 is five nand model
Entity dff2 is six  nand model

The Cadence run file is dff.run

The Cadence output file is dff.out

Similar Verilog source code
dff.v
test_dff.v
test_dff_v.out


The D flip flop ripple counter


Timing somewhat critical.

The D flip flop synchronous counter
More stable.


E is enable, '1' allows counting
z is q0, Z is d0, "and" output is a1
y is q1, Y is d1, "and" output is a2  etc.

Cadence verilog source file is dffs_cntr.v

Cadence verilog output file is dffs_cntr_v.out

Lecture 23 Disc Drum CD


Disc Drum CD

Take a look inside the hard drive being passed around.




Mine is bigger than yours.

How fast can you read a block of data?
There are four time components that must be known to answer
this question.
1) The time for the read head to get to the required track.
   This is seek time.
2) The time for the disk to rotate to start reading the first byte.
   This is the rotational delay time.
3) The time to transfer the data from the disk to your RAM.
   This is the transfer time.
4) Overhead that can be from software, application, OS or drivers.
   This is overhead time.

Seek time
The head may be on any track, thus there is seek time
before any data can be read. The manufacturers published
average seek time is standardized at the time to go from
track 0 to the middle track, measured in milliseconds.
In the 1990's the size of disk had become large enough
such that the measured average seek time was 1/4 the
published average seek time. We use 1/4 the published average
seek time for our homework and exams. For your computer,
having a hard drive with capacity over 200GB, I suggest using
1/8 the published average seek time for your estimates. The reason
is that the files you are working with tend to cluster, thus
you rarely will have a seek traveling 1/4 the tracks on the disk.
For my example below, the published average seek time was 5.4 ms
and thus 5.4/4 = 1.4 ms is used below.

Rotational delay time
The disk is spinning at a known Revolutions Per Minute, RPM.
We deal in seconds, thus divide the RPM by 60 to get
Revolutions Per Second, RPS. 

How long, on average, does it take for the read head to reach
data? This is the rotational delay time and only depends on
the RPS. On average the time will be the time for 1/2 of a
revolution, thus  1/2 * 1/RPS . Typically expressed in
milliseconds, ms.  Some values are:

   RPM   RPS  1/2 * 1/RPS
                 seconds  milliseconds
  3600    60     0.00833   8.33
  5400    90     0.00555   5.55
  7200   120     0.00416   4.16
10,025   167     0.00299   3.00
15,000   250     0.00200   2.00

Transfer time
The time to transfer data depends on the bandwidth, typically
given in Megabytes per second. The disk drive has internal RAM
and usually can deliver a continuous stream of bytes at near
the maximum transfer rate. The transfer may be slowed by your
computers system bus or your RAM or other contention for the
system bus to RAM path. The example below uses an 80MB/s
transfer rate. Thus 80MB can be transferred in one second.

Overhead time
The overhead time is estimated. 0.6ms

Example
  How long does it take to read
  a file from disk? (example calculation)

  time = average seek time +
         average rotational delay +
         transfer time +
         overhead

  published average seek = 5.4 ms
  "average" seek = 5.4/4     = 1.4ms

  10,025 RPM  or 167 RPS
  1/2 * 1/167 = .00299 sec   = 3.0ms

  Overhead assumed           = 0.6ms

  Size independent delay, sum= 5.0ms

  At 80 MB/sec transfer rate:

  10KB   100KB   1MB   10MB

  0.125  1.25   12.5   125.  transfer time in ms
  5.0    5.0     5.0     5.0
  _____  ____   ____   _____

  5.125  6.25   17.5   130.0 ms

  This is a one block "first read"
  The next read could be buffered

Notice that on small files, the latency times 1) 2) and 3) dominate.
On large files the transfer time dominates. Today, files in the
tens of megabytes are common. Many years ago most files were around
10 kilobytes. Today 1 to 10 megabyte is typical.

A benchmark I ran on reading 1KB, 10KB, 100KB, and 1MB of data
from a 10MB file.

file time_io.c  check how much is cached in ram
                assumed pre-existing data file  time_io.dat
                created by running  time_io_init

One computers output:
time_io.c 10MB file, read 1KB, 10KB, 100Kb, 1MB 
On rebooted machine, first read 
first read time 0.12 seconds 
more reads, cached? consistent? 
2 read time 0.06 seconds for 1KB block 
3 read time 0.06 seconds for 1KB block 
4 read time 0.06 seconds for 1KB block 
5 read time 0.06 seconds for 1KB block 
6 read time 0.06 seconds for 1KB block 
7 read time 0.06 seconds for 1KB block 
8 read time 0.06 seconds for 1KB block 
9 read time 0.05 seconds for 1KB block 
more reads, cached? consistent? 
2 read time 0.05 seconds for 10KB block 
3 read time 0.05 seconds for 10KB block 
4 read time 0.04 seconds for 10KB block 
5 read time 0.05 seconds for 10KB block 
6 read time 0.05 seconds for 10KB block 
7 read time 0.05 seconds for 10KB block 
8 read time 0.05 seconds for 10KB block 
9 read time 0.05 seconds for 10KB block 
more reads, cached? consistent? 
2 read time 0.08 seconds for 100KB block 
3 read time 0.07 seconds for 100KB block 
4 read time 0.09 seconds for 100KB block 
5 read time 0.07 seconds for 100KB block 
6 read time 0.07 seconds for 100KB block 
7 read time 0.06 seconds for 100KB block 
8 read time 0.08 seconds for 100KB block 
9 read time 0.08 seconds for 100KB block 
more reads, cached? consistent? 
2 read time 0.09 seconds for 1000KB block 
3 read time 0.09 seconds for 1000KB block 
4 read time 0.09 seconds for 1000KB block 
5 read time 0.09 seconds for 1000KB block 
6 read time 0.11 seconds for 1000KB block 
7 read time 0.10 seconds for 1000KB block 
8 read time 0.10 seconds for 1000KB block 
9 read time 0.10 seconds for 1000KB block 

Why did I reboot to run a file read test?
On a computer that is not shut down, a file could
remain in RAM and even partially in cache for days
to weeks, if you were not using the computer.

By now you should know that I do a lot of benchmarking.
I ran the above program on two computers each with two
operating systems with three disk types.

Block  2.5GHz      2.5GHz      1GHz       1GHz
 Size  P4 ATA 100  P4 ATA 100  ATA 66     SCSI 160
       Windows XP  Linux       Windows 98 Linux

  1KB   0.0000015   0.000001    0.000016   0.000004
 10KB   0.000015    0.000010    0.000060   0.000035
100KB   0.000150    0.000100    0.000500   0.000300
  1MB   0.003100    0.002000    0.005000   0.004000

Fine print: CPU time in seconds, most frequent value of eight
measurements after first read. Using fopen, fread, binary
block read. Each measurement read 10MB. e,g 10 blocks read
at 1MB, 100 blocks read at 100KB, 10,000 blocks read at 1KB.
Other than the first number that is 1.5 microseconds, the
numbers can be read as integer microseconds.

As expected the SCSI disk was faster than the ATA disk.
Note that the faster system clock can allow the actual
transfer rate to be near the maximum while a slower clock
speed can limit the transfer rate. The operating system,
drivers and libraries have some impact on total time. This
is lumped into "overhead."

Where do you find the disk specifications? Both the manufacturer and
some retailers publish the disk specifications, and some prices.

e.g.
evolution specs

2007 hard drives, note cache, RPM, transfer rate



Then SATA replaced ATA
Serial ATA changed the wiring and protocol. ATA had wide flat cables.
Driven by PC manufactures Dell, Gateway, HP, etc, they needed thinner
cables. Thus higher speed transfer over fewer wires.
Typical SATA bus maximum transfer rate is 3GB/s, 3 gigabytes per second.

Similar latency, similar seek, faster transfer rate.

A single drive with 500GB of storage became available at reasonable cost.

A terabyte of disk storage became practical for a desktop PC.
Now multiple terabyte 6Gb/s disks are available.



Still too slow!


Now, SSD, Solid State Disks
Replace the rotating disk drive with NAND Flash digital logic storage.

 Technology explanation 

 Performance comparisons 

 One technical specification 
Transcend 128GB $229.99  TS128GSSD25S-M
 
 enclosure was needed for desktops, initially 

Check for latest size, speed, cost
computer-SSD-search SSD



Reworking the example above for time to read a SSD file:

Transfer time
The time to transfer data depends on the bandwidth, typically
given in Megabytes per second. The example below uses an 80MB/s
transfer rate. Thus 80MB can be transferred in one second.

Overhead time
The overhead time is estimated. 0.6ms 
(Much of this is system software some I/O hardware time.)

No seek time, no rotational delay time, for SSD

Example
  How long does it take to read
  a file from disk? (example calculation)

  time = transfer time +
         overhead

  At 80 MB/sec transfer rate:

  10KB   100KB   1MB   10MB

  0.125  1.25   12.5   125.  trans
  0.6    0.6     0.6     0.6
  _____  ____   ____   _____

  0.725  1.85   13.1   125.6 ms

  This is a one block "first read"
  The next read could be buffered

Notice that on very small files, the overhead time dominates.
On large files the transfer time dominates. Today, files in the
tens of megabytes are common. Many year ago most files were around
10 kilobytes.

The SSD has a speedup of 7.07 for a 10KB file.
The SSD has a speedup of 1.03 for a 1MB file.

Your mileage may vary.

A typical desktop is executing  4,000,000 instructions per ms, millisecond.


Homework 6

Lecture 24 Busses


Busses


  Examples of Busses   circa 2012 including older  (changes with time)

  Bus name    Max       Max      Max   width  comment
              Mbits     MBytes   MHz
              per sec   per sec

  front side  17,024    2,128    133   128    many possible
              34,048    4,256    133   256
              19,200    2,400    150   128
              85,248   10,656    333   256
             136,448   17,056    533   256
             204,800   25,600    800   256
             225,280   26,160    880   256
             256,000   32,000  1,000   256
             320,000   40,000  1,250   256    (Mac G5)

  AGP          2,112      264     66    32
  AGP8X       17,056     2,132   533    32

  PCI          1,056      132     33    32
  PCI          2,112      264     33    64
  PCI          2,112      264     66    32
  PCI          4,224      528     66    64
  PCI          4,224      528    133    32
  PCI          8,448    1,056    133    64
  PCIX        17,056    2,132    533    32    extended, compatible
  PCIe        64,000    8,000   2000    32    express, one way, full duplex
                                              1,2,4,8,12,16 or 32 lanes
  ATA 100        800      100     25    32
  ATA 133       1064      133     33    32
  ATA 160       1280      160     40    32
  SATA 150      1200      150    600     2    one way, full duplex
  SATA std      1500      187   1500     1    one way, full duplex
                                              limited by motherboard
  SATA II 300   2400      300   1200     2
  SATA II std   3000      375   3000     1    no forcing to build standard
  SATA 3.0      6000      750   6000     1

  SCSI 1          40        5      5     8
  SCSI 2         160       20     10    16
  SCSI 3        1280      160     80    16
  SCSI UW3      2560      320    160    16
  SCSI 320      5120      640    320    16    has cable terminators

  Firewire1394   400       50    400     1
  Firewire1394b  800      100    800     1    many video cameras
  Firewire S16  1600      200   1600     1
  Firewire S32  3200      400   3200     1
  Firewire S80  6400      800   6400     1

  USB 1.1         12        1.5   12     1    slow
  USB 2          480       60    480     1    new cable
  USB 3         3200      400   1600     2    new cable, dual differential
                5000      625   2500     2    new connectors, optional speed
                6400      800   3200     2    micro, mini, etc.

  Fiberchannel  1000      125   1000     1    1062.5
  Fiberchannel  2000      250   2000     1    >mile
  Fibre 16GFC            3200  14000          full duplex 10Km
  Fibre 20GFC            5100  21000          full duplex

  Ethernet 10     10        1.25  10     1        
  Ethernet 100   100       12.5  100     1
  Ethernet 1Gig 1000      125   1000     1
  Ethernet 10G 10000    1,250  10000     1

  ISA            400       50     25    16    really old
  IEEE 1284 ECP    2.5      0.31   0.31  8    half duplex
  printer port

  V.90 56          0.056    0.005  0.056 1    modem, one way, full duplex

  OC-48          2,500                       optical cross country
  OC-192 STM64  10,000                       Optical Carrier
  OC-768 STM256 40,000  5,000    light
              Mbps      MBps     MHz 

The speed of light limits the amount of information that can be
sent over a given distance. Many busses have length restrictions.
  Light can travel about
   300,000,000    meters per second
       300,000    meters per millisecond
           300    meters per microsecond
             0.3  meters per nanosecond  (about 1 foot)
             0.3  millimeters per picosecond

Unchanged in last few decades. (slower inside integrated circuit)

Pentium 4 busses and PCI-X vs PCIe


Note one example of AGP being replaced by PCI-e and the mention
of many "busses" in the advertisement:






A non Intel architecture:
Below is a schematic of a one clock per instruction computer.



The operation for each instruction is:

  The Instruction Pointer Register contains the address of the
  next instruction to be executed.

  The instruction address goes into the Instruction Memory of
  Instruction Cache and the instruction comes out.
  "inst" on the diagram.

  The Instruction Decode has all the bytes of the instruction:

    The instruction has bits for the operation code.
    e.g. there is a different bit pattern for add, sub, etc.

    Most instructions will reference one register. The register
    number has enough bits to select one of the general registers.

    Many instructions have a second register. (Not shown here,
    on some computers there can be three registers.) The second
    (or third) register may be the register number that receives
    the result of the operation.

    Many instructions have either a memory address for a operand or
    a memory offset from a register or immediate data for use by
    the operation. This data is passed into the ALU for use by
    the operation, either for computing a result or computing
    an address.

  The general registers receive two register numbers and very
  quickly output the data from those two registers.

  The ALU receives two data values and control from the
  Operation Code part of the instruction. The ALU computes
  the value and outputs the value on the line labeled "addr".
  This line goes three places: To the mux and possibly into
  the Instruction Pointer if the operation is a jump or a branch.
  To the Data Memory or Data Cache if the value is a computed
  memory address. To the mux that may return the value to a register.

  The Data Memory or Data Cache receives an address and write data.
  Depending on the control signals "write" and "read":
  The Data Memory reads the memory value and send it to the mux.
  The Data Memory writes the  "write date" into memory at
  the memory location "addr".

  The final mux may take a value just read from the Data Memory
  or Data Cache and return that value to a register or
  take the computed value from the ALU and return that value
  to a register.

  While the above signals are propagating, the Instruction Pointer
  is updated by either incrementing by the number of bytes in the
  instruction or from the jump or branch address.

This is one instruction, the clock transitions and the next instruction
is started.

The timing consideration that limits the speed of this design is
the long propagation from the new Instruction Pointer value until
the register is written. Notice that the register is written on
clock_bar and the Data Cache is written on clock_bar. Any real
computer must use instruction and data caches for this design
because RAM memory access is slower than logic on the CPU chip.

Lecture 25 Protected Mode Addressing


Protections are available at many levels on computers.
Computer access may have physical limits, user names and passwords,
blacklisting web addresses, etc.

File access in Unix, Linux has protections:
dir  user  group  other    ls -ltr              
 d   rwx   rwx    rwx      read  write  execute,link

Memory access to pages by TLB in modern computers
no access, read only, execute allowed.

 
Protected Mode Addressing


More Intel information:
This lecture uses Intel documentation on the X86-64 and IA-32 Architecture.
In principal IA-32 covers all Intel 80x86 machines up to and including
the Pentium 4.
In principal X86-64 covers all new Intel computers including HPC.
Stored locally in order to minimize network traffic.

First look over Appendix B. (This is a  .pdf  file that your
browser should activate  acroread  to display. Look on the left
for a table of contents and ultimately click on Appendix B.
(See meaning of "s" and "w". then look at the various  "add"  instructions.)

Intel IA-32 Instructions(pdf) 

Note the "One Byte" opcodes. There are two tables with up to 256
instruction operation codes in each table.

Then move on to the "Two Byte" opcodes. The first opcode byte would
tell the CPU to look at the next byte to determine the operation code
for this instruction.

Sorry, the web and browsers and network are so darn slow, 
I have included .png  graphics of the pages:
> 
 
 
 
 
 
 

intel64-ia32.pdf full 3439 pages 16.8MB including Instructions(pdf) 

The X86-64 and IA-32 are CISC, Complex Instruction Set Computer.

This is in contrast to computer architectures such as the
Alpha, MIPS, PowerPC = Power4 = MAC G5, etc. that are
RISC, Reduced Instruction Set Computer. "Reduced" does not
mean, necessarily, fewer instructions. "Reduced" means
lower complexity and more regularity. Typically all instructions
are the same number of bytes. Four bytes equals 32 bits is the most
popular. Regular in the sense that all registers are general
purpose. Not like the IA-32 using EAX and EDX for multiply
and divide, X86-64 using RAX and RDX for multiply

All MIPS instructions are 32 bits, the 6 bit major opcode
allows 64 instructions and with the 6 bit minor opcode
there may be 4096 instructions. Just a few instruction are shown:

easier to program and simulate

Alpha another RISC architecture

Lecture 26 Virtual Memory paging hardware


Virtual Memory paging hardware


Modern terminology calls this a TLB.

some review:
Now, hardware can also be pipelined, for example a parallel multiplier.
Suppose we need to have at most 8 gate delays between pipeline
registers.



Note that any and-or-not logic can be converted to use only nand gates
or only nor gates. Thus, two level logic can have two gate delays.

We can build each multiplier stage with two gate delays. Thus we can
have only four multiplier stages then a pipeline register. Using a
carry save parallel 32-bit by 32-bit multiplier we need 32 stages, and
thus eight pipeline stages plus one extra stage for the final adder.



Note that a multiply can be started every clock. Thus a multiply
can be finished every clock. The speedup including the last adder
stage is 9 as shown in:
pipemul_test.vhdl
pipemul_test.out
pipemul.vhdl



A 64-bit PG adder may be built with eight or less gate delays.
The signals a, b and sum are 64 bits. See add64.vhdl for details.



add64.vhdl



Any combinational logic can be performed in two levels with "and" gates
feeding "or" gates, assuming complementation time can be ignored.
Some designers may use diagrams but I wrote a Quine McClusky minimization
program that computes the two level and-or-not VHDL statement
for combinational logic.

quine_mcclusky.c logic minimization

eqn4.dat input data

eqn4.out both VHDL and Verilog output

there are 2^2^N possible functions of N bits

Not as practical, I wrote a Myhill minimization of a finite state machine,
a Deterministic Finite Automata, that inputs a state transition table
and outputs the minimum state equivalent machine. "Not as practical" 
because the design of sequential logic should be understandable. The
minimized machine's function is typically unrecognizable.

myhill.cpp state minimization
initial.dfa input data
myhill.dfa minimized output



A reasonably complete architecture description for the Alpha
showing the pipeline is:

basic Alpha
more complete Alpha

The "Cell" chip has unique architecture:

Cell architecture

Some technical data on Intel Core Duo (With some advertising.)

Core Duo all on WEB

From Intel, with lots of advertising:
power is proportional to capacitance * voltage^2 * frequency, page 7.

tech overview

whitepaper


Intel quad core demonstrated


AMD quad core

By 2010 AMD had a 12-core available and Intel had a 8-core available.
 and 24 core and 48 core AMD


IBM Power6 at 4.7GHz clock speed

Intel I7 920 Nehalem 2.66GHz not quad   $279.99
Intel I7 940 Nehalem 2.93GHz quad core  $569.99
Intel I7 965 Nehalem 3.20GHz quad core  $999.99
Prices vary with time, NewEgg.com search Intel I7

Lecture 27 Arithmetic Logic Unit


Often called  ALU  a major part of the CPU 

Quick review:
Basic digital logic


The Arithmetic Logic Unit is the section of the CPU that actually
performs add, subtract, multiply, divide, and, or, floating point and
other operations. The choice of which operations are implemented is
determined by the Instruction Set Architecture, ISA. Most modern
computers separate the integer unit from the floating point unit.
Many modern architectures have simple integer, complex integer, and
an assortment of floating point units.



The ALU gets inputs from registers part1.jpg

Where did numbers such as 100010 for subop and  000010 for sllop
come from ? cs411_opcodes.txt





Note that bshift.vhdl contains two different architectures
for the same entity. A behavioral architecture using sequential
programming and a circuits architecture using digital logic
components.

bshift.vhdl

An 8-bit version of shift right logical, using single bit signals,
three bit shift count, is:




Where diagram said  "pmul16 goes here" a parallel multiplier and
a parallel divider would be included. The "result" mux would
get two more data inputs, and two more control inputs:
mulop and divop. Then the upper half of the product and
the remainder would be saved in a temporary register,
the "hi" of the "hi" and "lo" registers shown previously.
Then stored on the next clock cycle in this architecture.

Fully parallel multiplier (possibly pipelined in another architecture)




Fully parallel divider (possibly pipelined in another architecture)





There are many ways to build an ALU. Often the choice is based
on mask making and requires a repeated pattern. The "bit slice"
method uses the same structure for every bit. One example is:



Note that 'Operation' is two bits, 0 for logical and, 1 for logical or,
2 for add or subtract, and 3 for an operation called set used for
comparison.
'Binvert' and 'CarryIn' would be set to '1' for subtract.
'Binvert' and 'a' set to '0' would be complement.
The overflow detection is in every stage yet only used in the
last stage.

The bit slices are wired together to form a simple ALU:



The 'set' operation would give non zero if 'a' < 'b' and
zero otherwise. A possible condition status or register
value for a "beq" instruction.


If overflow was to be detected, the circuit below uses the
sign bit of the A and B inputs and the sign bit of the
result to detect overflow on twos complement addition.


 
More cores
AMD cores

Lecture 28 Architecture


Some computers I have programmed since 1960:
LGP 30
IBM 650
IBM 704
IBM 709
IBM 7090
PDP 11
DEC 10
Univac 1107
Univac 1108
IBM 360
DEC Alpha
SGI mips
Sun Sparc
TI 99
IBM PC
Intel PC
AMD PC
IBM Power PC
...            in at least 17 programming languages

32-bit and 64-bit  ALU  architectures are available.
Time to retire all 32 bit machines and software.

A 64-bit architecture, by definition, has 64-bit integer registers.
Many computers have had 64-bit IEEE floating point for many years.
The 64-bit machines have been around for a while as the Alpha and
PowerPC yet have become popular for the desktop with the Intel and
AMD 64-bit machines.



Software has been dragging well behind computer architecture.
The chaos started in 1979 with the following "choices."



The full whitepaper www.unix.org/whitepapers/64bit.html

My desire is to have the compiler, linker and operating system be ILP64.
All my code would work fine. I make no assumptions about word length.
I use sizeof(int)  sizeof(size_t) etc. when absolutely needed.
On my 8GB computer I use a single array of over 4GB thus the subscripts
must be 64-bit. The only option, I know of, for gcc is  -m64 and that
just gives LP64. Yuk! I have to change my source code and use "long"
everywhere in place of "int". If you get the idea that I am angry with
the compiler vendors, you are correct!

big.out

The early 64-bit computers were:

DEC Alpha

DEC Alpha

IBM PowerPC

Some history of 64-bit computers:




Java for 64-bit, source compatible


Don't panic, you do not need to understand everything about
the Intel Itanium architecture:

IA-64 Itanium

Some history of the evolution of Intel computers:

Intel X86 development

long list

More cores are better, more parallel execution.



Note: Python integer is multiple precision integer.
You can computer 52! with no extra flags or libraries or code.
Java and many other languages now have integer as 64-bits.
The laggert is the "C" language, even with -m64, int is still 32 bits

Lecture 29 Review

  Review previous lectures 15 thorugh 28.
  No assembly language questions, that was covered in Midterm Exam.

  Sample questions will be presented in class.

Lecture 30 Final Exam

The midterm was considered the end of the Assembly Language
part of this course. Thus, the final exam will cover 
lectures 15 through 29 on digital logic and computer organization.

There will be questions of types:
  true-false
  multiple choice
  short answer (words, numbers, logic equations)



Automate!

Go to top

Last updated 10/21/2015

CMPE 310 Selected Lecture Notes

This is one big WEB page, used for printing

IEEE Floating point formats

Strings of characters

Optional future installation on your personal computer

We will use 64-bit in this course, to expand your options.

First homework assigned

Computer access for this course

Assembly Language

First example hello.asm

Variable Data and Storage allocation, sections

Use of registers and little endian

Basic NASM syntax

Sections or segments:

Efficiency and samples

Printing floating point

Integer arithmetic

Floating point arithmetic

Shift data in a register

First project is assigned.

Instructions and data come from the cache

gdb disassemble main produces:

part of where_64.lst

Options that may allow you to debug

implement fib.c using registers

implement fib.c using memory

Operating Systems use pages

if-then-else in assembly language

loop in assembly language

logic operations in assembly language

loops in assembly language

Horners method to evaluate polynomials in assembly language

serial vs parallel, slow vs fast

Need more RAM?

Assembly language can run with no C compiler and no OS

Special hardware may need assembly language

Project 3 is assigned

Intel 82C55 needs assembly language, book page 396

demonstrate time_test if possible

The basic "and" gate:

The basic "and" gate:

The basic "or" gate:

The basic "nand" gate:

The basic "nor" gate:

The basic "xor" gate:

The basic "xnor" gate:

The basic "not" gate:

Combinational digital logic uses Boolean Algebra.

Any truth table can be converted to a equation or schematic.

Any equation can be converted to a truth table.

Any digital logic schematic can be converted to a truth table.

FPGA and other CAD information

Using Cadence VHDL on Linux.GL machines

"Hello World sample programs in VHDL

Other Digital Logic Tool Links

The full adder can be written as an entity in VHDL

The full adder can be written as a module in Verilog

The connections are written for VHDL as

The connections are written for Verilog as

Subtract

VHDL implementation is

Verilog implementation is

How fast can you read a block of data?

Seek time

Rotational delay time

Transfer time

Overhead time

Example

Then SATA replaced ATA

Still too slow!

Now, SSD, Solid State Disks

Transfer time

Overhead time

No seek time, no rotational delay time, for SSD

Example

Often called ALU a major part of the CPU