CMSC 455 Lecture 3b, multiprocessors,MPI,threads and tasks

    <- previous    index    next ->

Lecture 3b Multiprocessors, MPI, threads and tasks


We have a number of clusters at UMBC, I happen to have used
our Bluegrit, Bluewave, Tara and Maya and Taki clusters and the 
MPI examples are from these multi processor machines.
For multi core machines, there are Java Threads and
"C" pthreads and Ada tasks. I have a 12 core desktop
and Intel has a many core computer.
Examples are presented below.
At the end are a few multi core benchmarks for you to run.
We can use our biggest super computer on campus, Taki.

NOTE:
MPI is running on a distributed memory system.
Each process may be considered to have local memory and
in general, there is no common shared memory.

Multicore machines are described here as shared memory systems.
All memory is available to all threads and tasks.
Also know technically as a single address space.

Some parallel programming techniques apply to both
distributes and shared memory, some techniques do not
apply to both memory systems.


MPI
MPI stands for Message Passing Interface and is
available on many multiprocessors. MPI may be installed
as the open source version MPICH. There are other
software libraries and languages for multiprocessors,
yet, this lecture only covers MPI.

The WEB page here at UMBC is
www.csee.umbc.edu/help/MPI

Programming in MPI is the SPMD Single Program Multiple Data style of
programming. One program runs on all CPU's in the multiprocessor.
Each CPU has a number, called a rank in MPI, called myid in my code and
called node or node number in comments.

"if-then-else" code may be based on node number is used to have
unique computation on specific nodes. There is a master node,
typically the node with rank zero in MPI. The node number may
also be used in index expressions and other computation. Many
MPI programs use the master as a number cruncher along with the
other nodes in addition to the master serving as overall control
and synchronization.
 
Examples below are given first in "C" and then a few in Fortran.
Other languages may interface with the MPI library.
These just show a simple MPI use, these are combined later for
solving simultaneous equations on a multiprocessor.

Just check that a message can be sent and received from each
node, processor, CPU, etc. numbered as "rank".

roll_call.c

roll_call.out

Just scatter unique data from the "master" to all nodes.
Then gather the unique results from all nodes.

scat.c

scat.out

Here is the Makefile I used.
Makefile for C on Bluegrit cluster


Repeating the "roll_call" just changing the language to Fortran.

roll_call.F

roll_call_F.out

Repeating scatter/gather just changing the language to Fortran.

scat.F

scat_F.out

The Fortran version of the Makefile with additional files I used.

Makefile for Fortran on Bluegrit cluster

my_mpif.h only needed if not on cluster

nodes only needed if default machinefile not used

MPI Simultaneous Equations

Now, the purpose of this lecture, solve huge simultaneous
equations on a highly parallel multiprocessor.

Well, start small when programming a multiprocessor and print out
every step to be sure the indexing and communication is exactly
correct.

This is hard to read, yet it was a necessary step.

psimeq_debug.c

psimeq_debug.out

Then, some clean up and removing or commenting out most debug print:

psimeq1.c

psimeq1.out

The input data was created so that the exact answers were 1, 2, 3 ...
It is interesting to note: because the data in double precision floating
point was from the set of integers, the answers are exact for
8192 equations in 8192 unknowns.

psimeq1.out8192

|A| * |X| = |Y|  given matrix |A| and vector |Y| find vector |X|

  | 1 2 3 4 5 | |5| | 35|  for 5 equations in 5 unknowns
  | 2 2 3 4 5 | |4| | 40|  the solved problem is this
  | 3 3 3 4 5 |*|3|=| 49|
  | 4 4 4 4 5 | |2| | 61|
  | 5 5 5 5 5 | |1| | 75|

A series of timing runs were made, changing the number of equations.
The results were expected to increase in time as order n^3 over the
number of processors being used. Reasonable agreement was measured.

Using 16 processors:
Number of  Time computing  Cube root of
equations  solution (sec)  16 times Time (should approximately double
    1024       3.7          3.9           as number of equations double)
    2048      17.2          6.5
    4096      83.5         11.0
    8192     471.9         19.6

More work may be performed to minimize the amount of
data send and received in "rbuf".


C pthreads Simultaneous Equations
Basic primitive barrier in C pthreads
run_thread.c
run_thread_c.out

Simultaneous equation solution using AMD 12 core
tsimeq.h
tsimeq.c
time_tsimeqb.c
time_tsimeqb.out

More examples of pthreads with debug printout
thread_loop.c with comments
thread_loop_c.out



Java Simultaneous Equations

Java threads are demonstrated by the following example.
RunThread.java
When run, there are four windows, each showing a dot as that thread runs.
RunThread.out
Note that CPU and Wall time are measured and printed. (on some Java versions)

The basic structure of threads needed for my code:
(I still have not figured out why I need the dumb "sleep" in 2)
Barrier2.java
Barrier2_java.out
(OK, several versions later)
CyclicBarrier4.java
CyclicBarrier4_java.out

Simultaneous equation solution with multiple processors in
a shared memory configuration is accomplished with:

psimeq.java
test_psimeq.java test driver
test_psimeq_java.out output
psimeq_dbg.java with lots of debug print
test_simeq_dbg.java with debug
test_psimeq_dbg_java.out output with debug
A better version making better use of threads and cyclic barrier:
simeq_thread.java
test_simeq_thread.java test driver
test_simeq_thread_java.out output

And, test results for "diff" the non threaded version
test_simeq_java.out output

Some crude timing tests:
time_simeq.java test driver
time_pimeq_java.out output
time_psimeq.java test driver
time_psimeq_java.out output yuk!
time_simeq_thread.java test driver
time_pimeq_thread_java.out output quad core



Ada Simultaneous Equations

Simultaneous equation solution with multiple processors in
a shared memory configuration is accomplished with:

psimeq.adb
test_psimeq.adb test driver
test_psimeq_ada.out output
psimeq_dbg.adb with lots of debug print
test_simeq_dbg.adb with debug
test_psimeq_dbg_ada.out output with debug
time_psimeq.adb test driver
time_psimeq_ada.out outputs

Then using Barriers
bsimeq_2.adb
time_bsimeq.adb test driver
time_bsimeq_ada.out outputs

Another tutorial type example
task_loop.adb with comments
task_loop_ada.out


Python Simultaneous Equations

The basic structure of threads needed for my code (python2):
barrier2.py
barrier2_py.out



Multiprocessor Benchmarks

"C" pthreads are demonstrated by an example that measures the
efficiency of two cores, four cores or eight cores.
time_mp2.c
time_mp2.out

The ratio of Wall time to CPU time indicates degree of parallelism.

time_mp4.c 4 core shared memory
time_mp4.out

time_mp8.c 8 core shared memory
time_mp8_c.out

My AMD 12-core desktop computer July 2010
time_mp12.c 12 core shared memory
time_mp12_c.out

time_mp4.java 4 core shared memory
time_mp4_java.out

time_mp8.java 8 core shared memory
time_mp8_java.out


pthreads using mutex.c and mutex.h
mutex.c encapsulate pthreads
mutex. encapsulate pthreads
thread4m.c main plus 4 threads
thread4m_c.out output
thread11m.c main plus 11 threads
thread11m_c.out output

for comparison, using just basic pthreads
thread4.c main plus 4 threads
thread4_c.out output

Comparison using Java threads, C pthreads  on big  matrix multiply
 c[1000][1000] = a[1000][1000] * b[1000][1000]
 for comparison, one thread time and four thread time:
Java example
matmul_thread4.java source code
matmul_thread4_java.out output
matmul_thread1.java source code no threads
matmul_thread1_java.out output

C pthread example
matmul_pthread4.c source code
matmul_pthread4.out output
matmul.c source code no threads
matmul_c.out output

Python3 threading and multitasking
thread3.py3 source code
thread3_py3.out output
multi.py3 source code
multi_py3.out output

C OpenMP example  1000 caused segfault, cut to 510
omp_matmul.c source code
omp_matmul.out output

    <- previous    index    next ->

Lecture 3b Multiprocessors, MPI, threads and tasks

MPI

MPI Simultaneous Equations

C pthreads Simultaneous Equations

Java Simultaneous Equations

Ada Simultaneous Equations

Python Simultaneous Equations

Multiprocessor Benchmarks

pthreads using mutex.c and mutex.h

Comparison using Java threads, C pthreads on big matrix multiply

Java example

C pthread example

Python3 threading and multitasking

C OpenMP example 1000 caused segfault, cut to 510

Other links

Go to top