Class | Date / Room | Main Topic / Handouts and Attachments / Summary | |
Lecture 1 | Mo. 12/05/11 | Motivation for parallel computing | |
Room 2421 | HPCF-2010-2; HW 1; research slides; PDF transcript; class summary | ||
The purpose of today's class was to
(i) motivate the need for parallel computing and its goals,
(ii) demonstrate the teaching techniques, and
(iii) demonstrate the learning technques. For (i), I showed slides numbers 2, 3, 4 as information about myself, slides 13, 14, 17, 20, 21 about a prototypical PDE problem, slides 60-67 to introduce how a parallel computing cluster looks like and what the issues are with distributed-memory clusters, and slides 78, 79, 81, 82, 83 to motivate the following conlusion: The most important purpose of parallel computing is to pool the memory of several compute nodes to allow the solution of larger problems than possible on a serial node. The second most important purpose is to speed up calculations to make calculations feasible within reasonable amoung of time. For (ii), I used my iPad2 with NoteTakerHD to write and project hand-written lectures, which will be posted as PDF transcripts here, I annotated the slides with red color (slides posted without annotations here), and used iSSH to log in briefly to the cluster hrz-cs400.hrz.uni-kassel.de that we will use in this class. We talked about (iii) only in as much as having everybody introduce themselves (oral English) and noted the homeworks (written English); I will return to this in the future to clarify more that this class has the goal to prepare you for research, so learning how to learn from written material (by itself or in preparation for lecture), writing reports of progressive sizes, and acting more and more independently are all built into the course. We briefly flipped through the various webpages for this course and related that you are asked to do on HW 1; see URLs there. You should read the first four pages of the tech. report HPCF-2010-2 (notice: year 2010, not 2011) that is linked here via the HPCF Publication page. | |||
Lab 1 | Tu. 12/06/11 | Linux, hello_serial.c, hello_parallel.c, running jobs | |
HW 1 due | Room 2421 | HW 2; mpich2_pgi.qsub; hello_parallel.c; hello_serial.c; class summary | |
We effectively tried to work our way through
Problems 1 and 3 of HW 2, by me demonstrating using
the projector and students following along. We ran out of
time, so you should continue by experimenting yourself,
which is what the homework asks you to do.
Problem 1 introduces several things: We logged in to the
cluster hrz-cs400.hrz.uni-kassel.de; when doing this and
the following, we cross-checked against information on the
webpage of the IT Servicezentrum. I demonstrated basic
Unix/Linux commands such as pwd, ls -l, mkdir, cd, rm.
For Problem 1, I showed how to program in the language C,
by writing a "Hello, world!" program from scratch;
see attached file hello_serial.c. To do this, we had to
learn the basics of the vi editor, namely to enter it,
simply say "vi filename" which puts you in command mode
of the editor; type "i" to enter input mode; then
you can enter text. Use ESC to escape the input mode and
return to command mode. In this mode, you can use the
keyboard keys "h", "j", "k", "l" (or apparently the arrow keys) to
go left, up, down, and right, respectively. Finally, to
quit the editor and save ('write to file') the result,
type ":wq" and hit return.
Problem 3 asks you to download two posted files.
The easiest method at this point turned out to be to
click on the links of the posted file to display them in
the web browser, highlight all contents, and copy-and-paste
it into the file by the desired name. I realize that all this can be very confusing at first. Our goal is to get something working at first as fast as possible, without explaining everything. Watch for the next lectures to explain more and more about what is actually going on. | |||
Lecture 2 | We. 12/07/11 | Chapter 3: Greetings! | |
Room 0450 | PDF transcript; video recording; greetings.c; class summary | ||
We reviewed actually the steps associated
with Problem 3 of Homework 2 that was mostly done already
yesterday in the lab. I made the point that for a more
systematic study, it will be useful to create one directory
for each desired combination of nodes and ppn.
In this process, I showed a number of Unix features
such as how to show hidden files ("ls -al"),
how to define short-hand commands (so-called aliases)
for the Korn shell (/bin/ksh, see output of "echo $SHELL")
in its run-control file .kshrc (alias ll="ls -l"),
mkdir, cp -p (to preserve date/time stamps and permissions),
cp -r (recursive copy of directories and sub-directories
and their files), diff (to compare two files or directories),
mv (move or rename), and more.
Regarding use of this cluster, we used qsub to submit a job,
then qstat to see the status of all currently running or
queued jobs (recommended options are -a and -n) and
the Ganglia software,
which you access by pointing a web browser to
We noticed that this hello_parallel.c does not communicate
actually between processes. To this end, I formulated a
task (see comments in greetings.c) and then showed how to
design an algorithm and code from the task.
See Chapter 3 of Pacheco for a longer written explanation
and for an equivalent code using a slightly different style.
The point was to learn about MPI_Send and MPI_Recv.
We learned how to use them from their man pages
and then implemented code in the attached greetings.c.
Its output now results in an ordered list of processes.
Please notice that this also shows you how to do for-loops
and if-statements in C.
We compiled this code with | |||
Lecture 3 | Mo. 12/12/11 | Chapter 4: An Application: Numerical Integration | |
Room 2404 | PDF transcript; video recording; class summary | ||
We started by looking at a portion of the
taped lecture from last time. It works very smoothly on the iPad,
but the recommendation for other laptops/desktops is to download
the mpeg-file first, then play. The purpose of today is to extend our use of MPI communications to a context where the communications are part of a numerical algorithm and have the purpose to compute some final result that requires all processes to work together. We use the trapezoidal rule program from Chapter 4 in Pacheco as example. I went through the background of the method and its parallelization idea. The idea tries to divide up the significant calculation work among all processes and pre-compute as much as possible locally to each process, then communicate as little as possible (here: we form local integrals and then communicate a single real number local_In, as opposed to communicating all f(x_i) and then accumulating the entire sum for In on Process 0). My presentation tried to demonstrate development of code and its documentation in text. I demonstrated the idea of using "local_" (or later "l_" for short) as prefix to variables to indicate that they are local in the sense that they can different values on each process (and not global in the sense that the value of that variable is the same on all processes). Like last time, we designed the parallel code from the explicitly stated algorithmic idea. Notice that we designed the algorithm from its parallelization idea, not by looking at a serial code; this leads us to focus more on the higher level and bury calculation details in a function; when actually writing code from scratch, you will still likely start from this bottom up, but you can design more appropriate variable names with the given tactic. We ended by reading Pacheco's trap.c file from Chapter 4 and noticed how some details are different than my design, including choice of variable names. Regarding C programming, the code provides examples of function definitions and the need for a prototype of the function before calling it, float as intrinsic datatype of single-precision real variable, and the ampersand to obtain the address of a variable (used in first argument of the MPI_Send and MPI_Recv). | |||
Lab 2 | Tu. 12/13/11 | Ch. 3 and 4 | |
HW 2 due | Room 2421 | HW 3; class summary | |
At the beginning, we went through some preparatory steps:
Create directory lab02, copy trap.c from Pacheco's Chapter 4
(ppmpi_c/chap04/trap.c) to lab02, compiled it with mpicc
(mpicc trap.c -o trap), and ran it using qsub;
for the latter point, I suggested not to download a new
submission script from somewhere, but to copy the one
from the last lab and modify (mpijob=./trap). Some students still had to download the code from Pacheco; this is easiest done by navigating to the book's webpage (link from the course homepage), then "save the link address" (by right-mouse buttom or similar) of the link to the C code (labeled "C (updated 2000/01/23)" on Pacheco's page, then paste this link address as argument to a wget command at the Linux prompt in the cluster as wget http://www.cs.usfca.edu/~peter/ppmpi/ppmpi_c.tar.Z This puts the file ppmpi_c.tar.Z into the current directory. This file is zipped, i.e., compressed, and can be unzipped by gunzip ppmpi_c.tar.Z which gives the file ppmpi_c.tar. Tu untar this, i.e., expand this directory's sub-directories and their contents, use tar xf ppmpi_c.tar You then will have a directory called ppmpi_c. Change directory into it and list the sub-directories, which shows, among others, chap04, which contains, among others, trap.c. We discussed some ideas for how to improve this code. Two specific issues right at the beginning are that the compilation with mpicc gives a warning (this should be fixed) and the output is poor. I pointed out that the output raises the question what mathematical problem we are solving anyhow; looking through the code for a, b, and f(x) gives the information, and this allows us to obtain the true integral value I=1/3. Using this, I wanted to teach the reasonable habit to output all available information such as, at least, the number of subintervals n, the step size h, true value I, the approximation In (which is the variable total in Pacheco's code), and the error I-In (coded as I-total); in fact, since the latter one should have magnitude h^2, you may also want to print out h*h, as well. I spent some time to demonstrate some pitfalls of C here, such as coding I=1/3 just like that, which gives integer division 1/3=0 and hence float I=0.0. Moreover, this outputs as I=0.000000 for format conversion %f, which is what Pacheco uses in several places. Once we fixed the computation as I as I=1.0/3.0, we still a strange thing like error=I-total=0.000000, which is odd, since there should be some error on the order of h^2=1e-7. The issue is that the output format %f simply does not show enough digits. I suggested highly to always use %24.16e, which is in exponential format, normalized to one non-zero decimal digit before the decimal point, and with 16 decimal digits after the decimal point, then multiplication with powers of 10 as indicated by notation e-12 at the end. Single-precision numbers have about 8 significant decimal digits of accuracy, so the above is too many clearly (%15.8e would suffice). But for very large n values, we will have reason to switch to using double-precision numbers anyways, so I anticipate this with the %24.16e. The 24 indicates the fieldwidth of the output field. We need to allow for 8 additional characters here, since double-precision numbers range from 1e-308 to 1e+308 approximately, hence there can be up to 3 digits necessary in the exponent of 10 as in e-123. In the process of working on trap.c and the submission script, we learned a number of new vi commands: Lowercase x deletes the character under the cursor. When in command mode, while lowercase i switches into insert mode before (!) the position of the cursor, lowercase a switches into insert mode after (!) the position of the cursor; we needed this to append something at the end of the line (hence a for append); to jump to end of the current line for insertion, use uppercase A. When in command mode, lowercase o inserts a blank line and enters insert mode after (!) the line where the cursor is, while uppercase O inserts a blank line and enters insert mode before (!) the line where the cursor is. To jump to the beginning of the file, use ":1", which is a special case of going to any line by saying ":n" for any number n. To jump to the end of the file, use uppercase G in command mode. CTRL-F and CTRL-B move one page forward and backward by one page, respectively. CTRL-U and CTRL-D move half a page down and up, respectively. | |||
Lecture 4 | We. 12/14/11 | Chapter 5: Collective Communication | |
Room 0450 | PDF transcript; video recording; class summary | ||
I started the class by discussing the expectations for the
performance study of Homework 3, Part (f). The point is to
run on as many nodes as possible and see if the code runs
faster. Since the amount of work needs to be higher to
make this easier to take place, expect to make n larger
(recall that Pacheco uses 1024, which is very small in
this sense). I explained the expectation in the way of a
graphical representation of speedup. The idea to learn this
is to take a closer look now at the tech. rep. HPCF-2010-2,
which is posted on the HPCF webpage www.umbc.edu/hpcf
under Publications. I also explained the C programming issues of command-line arguments, how to do test output of them, and how to use them to get the variables a, b, n in the example of the trapezoidal rule program. Notice by the way again that code that performs well is only acceptable if its results are numerically correct, so you should check your printouts of h*h and the error. Then we started with Chapter 5 on Collective Communications. This is a chapter with powerful commands that implement essentially all reasonable communications involving all parallel processes. We covered the commands MPI_Bcast, MPI_Reduce, and MPI_Allreduce. | |||
Lecture 5 | Mo. 12/19/11 | Chapter 5: Collective Communication | |
Room 2404 | PDF transcript; video recording; class summary | ||
We continued with Chapter 5. The starting point was to
develop code for a parallel dot product as in Sec. 5.5,
but already using MPI_Allreduce from Sec. 5.6.
This and more are differences of my presentation to Pacheco's
presentation.
The design of a dot product led us to clarify first how
split vectors onto parallel processes. Then we introduced MPI_Gather, MPI_Scatter, and MPI_Allgather from Sec. 5.7, but using strictly vectors x and l_x as examples (not matrices like in Pacheco). Finally, I lectured a little on the Unix make utility with an example of C code split into two files file1.c and file2.c. The make utility makes this more efficient then by only re-compiling one of these files, if only one of them has changed since the last compilation. These date/time stamp comparisons are the contents of the conditional lines (those with a colon ":") in the Makefile. To show this, I logged in to the cluster and showed the files supplied for HW 4 and 5, which include a Makefile and code split into files main.c and memory.c. I demonstrated "make" as well as "make clean". We will look at this more as part of the lab.
MISTAKE ALERT: The examples of malloc() and calloc() in class
just say "x = malloc(...)". But malloc() and calloc()
return pointers to an address that is not of a defined
type, so-called void. To use this for x, which is defined as
"double *x", we need to cast the output to this type, so
the example should read | |||
Lab 3 | Tu. 12/20/11 | Introduction to HW 4 and 5 on Power Method | |
HW 3 due | Room 2421 | HW 4 and 5; ver0.0suppliedfiles.tgz; class summary | |
We discussed the code contained in the ver0.0suppliedfiles
directory. There are two Matlab m-files; use the driver_power.m to run a example of the power method for a small matrix (dimension n=4) in Matlab. This is what we want to translate into C in HW 5, namely have a C function for the power method like my_power.m and the contents (in particular output) of driver_power.m in the main() program in C. There is the Makefile that controls the compilation and linking of the two c-files main.c and memory.c. I showed how to systematically create a header file from a c-file, on the example of a utilities.c. I demonstrated how to compile one file at a time and then link, which is what the Makefile does. The executable is power, and the supplied file mpich2_pgi_power.qsub shows how to run this executable in the line that defines mpijob; the point of this demonstration is to show how to call power with command-line arguments.
MISTAKE ALERT: The file
mpich2_pgi_power.qsub may have a mistake in it
(depending on which version I posted):
The mpijob line should read SMALL TYPO in hw4and5.pdf: The text refers to the output file slurm.out; that should say qsub.out. | |||
Lecture 6 | We. 12/21/11 | Chapter 5: Collective Communication | |
Room 1403 | PDF transcript; video recording; class summary | ||
We covered the MPI commands of Chapter 5 in Pacheco
in the last two lectures. We used sensible examples
involving vectors and discussed how to create these in C.
Today's purpose is to discuss how to create matrices in C
and then use them in MPI commands.
My main point was that the C standard does not guarantee
anything about how the memory of a matrix in C is organized,
which is a problem for our purposes of parallel computing with MPI.
This is actually a problem for scientific computing in general,
since C does not really understand multi-dimensional arrays
at all, which makes it incompatible with other languages.
This shows up as issue when you try to use existing libraries
such as BLAS or LAPACK.
To make our C code efficient and compatible, we will use
one-dimensional arrays only, with a column-major ordering of
the data, hence split a matrix A onto parallel processes as l_A
as blocks of columns to maintain the consecutiveness of the memory.
I explained this in hand-writing, see PDF transcript,
and then discussed the code supplied for HW 4 and 5,
which demonstrates this in the definition and example of l_A,
see main.c in the supplied files. We used HW 3's trap code to demonstrate the need to include header files for C functions on the example of pow() from math.h and mentioned the linking to the math library by -lm; see LDFLAGS in the Makefile in the files supplied for HW 4 and 5. I showed some of my solutions to HW 3 in terms of performance and accuracy. These aspects of the lecture were not video taped. | |||
--- Christmas Break --- | |||
Lecture 7 | Mo. 01/09/12 | IEEE Standard for Floating-Point Arithmetic | |
Room 2404 | PDF transcript; video recording; class summary | ||
Handout: lecture copy of IEEE standard, Matlab examples, and
article on Patriot missile. We covered the IEEE Standard 754 for binary floating-point arithmetic. More specifically, we covered its portion on number representation. It is implied that operations among such numbers also result in accuracies, which cause round-off errors of the same type as that caused by the representation itself. Since a number of students could not make today, we covered this topic that stands on its own and is not part of the on-going development of MPI techniques. However, its ideas explain a number of observations that come up regarding accuracy of the results of Homework 3 on the trapezoidal rule and which we will discuss in the next lab. | |||
Lab 4 | Tu. 01/10/12 | IEEE Standard for Floating-Point Arithmetic | |
HW 4 due | Room 2421 | class summary | |
We computed several numbers of the IEEE standard for
double-precision floating-point numbers concretely,
specifically, the largest number,
the smallest normalized number,
the smallest de-normalized number, machine epsilon.
and also the number x=0.1.
We looked at the Matlab commands eps, realmax, and realmin
and their documentation.
We obtained their binary representations in mathematical
notation, in terms of 64 bits, as well as 16 hexadecimal
numbers of these bits.
I demonstrated that Matlab can output this using "format hex"
and that we were able to predict this output. Finally, I did a little Matlab test, where you add h=0.1 ten times; notice that the final result is not 1, which you can see by subtracting 1 from the result. This shows the potential for accumulated round-off in repeated additions, even for fairly large numbers like 0.1 only added a few times. One strategy to avoid this is to pick h as computer number (e.g., here h=1/16). Another, more general one is to avoid repeated additions wherever possible. | |||
Lecture 8 | We. 01/11/12 | Chapter 11: Performance | |
Room 1403 | PDF transcript; video recording; class summary | ||
We discussed a number of considerations and best practices that go into designing and executing meaningful performance studies for parallel code. This presentations includes also remarks that explain the contents of slides 64 to 67 of the research slides presented on the first day of the semester. | |||
Lecture 9 | Mo. 01/16/12 | The Library Packages BLAS and LAPACK | |
Room 2404 | PDF transcript; video recording; class summary | ||
I introduced the BLAS = basic linear algebra subroutines today.
For one thing, they are historically significant
and present an example for black-box software that
one might download and use inside larger code.
But we also looked at them to motivate several features
that I stressed so far, such as the one-dimensional array
storage for matrices, which exactly interfaces with BLAS.
I also discussed the order of indexing that needs to be
right for a dramatic difference in performance;
we can see this by trying i-j vs. j-i ordering in the
homework on the power method, while in turn the BLAS
function for matrix-vector multiplication is yet faster.
To demonstrate BLAS, we went to www.netlib.org
and 'browsed' for 'blas', then looked at the code for ddot.f
as example. In that, we saw the idea of loop unrolling,
which makes efficient use of the CPU's cache and
can make code much faster. I want to stress here that this was a lecture and looking at historic Fortran code should be viewed as looking at examples. In reality, you would use libraries supplied with or for your compiler, such as ACML with the Portland Group compiler suite. Their code has been, at a minimum, compiled with an optimizing compiler and with the particulars of your hardware in mind (e.g., cache size, machine epsilon), and might be different in code from the original Fortran. The files supplied with the next lab show how to access the ACML libraries. In the lab, we will also look at the actual documentation available on our system. | |||
Lab 5 | Tu. 01/17/12 | How to Use BLAS | |
HW 5 due | Room 2421 | HW 6; Makefile; utilities.c; utilities.h; class summary | |
The assignment, as stated, expects programming in C to
time four different ways to compute the matrix-matrix product
C=A*B. Our goal for the lab itself should be to solve
this problem on paper, that is, the point is to
find each appropriate BLAS function and learn how to
use it on paper. Some of these functions are more complicated
to use than the ddot() function yesterday in class. I would like to demonstrate that you can proceed as follows. Example: Consider as example the goal to use BLAS for a matrix-vector product.
| |||
Lecture 10 | We. 01/18/12 | Class cancelled due to Vollversammlung | |
Lecture 11 | Mo. 01/23/12 | Chapter 6: Grouping Data for Communication | |
Room 2404 | PDF transcript; video recording; class summary | ||
After a couple of weeks on scientific computing (not necessarily
parallel), we returned to learning about MPI.
Derived datatypes can be used in MPI to simplify the
calling sequences of MPI communication commands.
We introduced these following the examples in Pacheco,
with only slight modifications and additional comments
to motivate them better. We will continue to make the commands discussed more applicable next time. | |||
Lab 6 | Tu. 01/24/12 | Use of BLAS2 and BLAS3 | |
HW 6 due | Room 2421 | HW 7; class summary | |
By working on Problem 1. of HW 7, we in effect continued from last time. We talked about solution with naive code and BLAS1, then developed BLAS3 and BLAS2 solutions. I pointed out the strategy of working on BLAS3 first, since that is in fact easier than BLAS2. Then I showed my solution. We observed that BLAS3 is factor 8-10 faster than BLAS2, which is same factor faster than BLAS1. Naive code can behave very differently, depending on how the for loops are ordered. I suggested to really formulate the lessons learned from Problem 1. before working on Problem 2. | |||
We. 01/25/12 | |||
Room 1403 | Class cancelled due to field trip conflict for several students | ||
Lecture 12 | Mo. 01/30/12 | Chapter 6: Grouping Data for Communication | |
Room 2404 | PDF transcript; video recording; class summary | ||
We continued the material from last time, but this time applied to the data structure of one-dimensional arrays with column-oriented storage for matrices. This was useful for thinking through the issues of a derived data type again, and then for realizing facts about them and the MPI_Scatter function. This led to the solution shown in Pacheco in Sec. 8.4.5. | |||
Lab 7 | Tu. 01/31/12 | Discussion of HW 3 on the Trapezoidal Rule | |
HW 7 due | Room 2421 | class summary | |
We discussed the solution to Homework 3 on the trapezoidal rule.
Specifically, I showed the actual C code that implements
the load-balancing of splitting n trapezoids onto np processes,
as explained in class yesterday. I continued by showing
the rest of the code, for instance, test output
that lists in nice and readable form this splitting. Then I showed calculated results of the error for several values of n=1e3, 1e4, ..., 1e9 and essentially all possible numbers of parallel processes in each case. We learned how it might look like if results are not reliable. We discussed the influence of single- vs. double-precision, accumulated round-off, and the most accurate implementation of the trapezoidal rule. We ended by looking at my Matlab code that collects the data from output files and organizes them into tables of wall clock times, speedup, and efficiency, and plots speedup and efficiency. | |||
Lecture 13 | We. 02/01/12 | Chapter 7: Communicators and Topologies | |
Room 1403 | PDF transcript; video recording; class summary | ||
We introduced MPI commands how to create one communicator as subgroup of the processes in a given one and how to subdivide a communicator into several communicators. We discussed particularly commands necessary in the special case of a Cartesian communicator grid. We will use those for an implementation of Fox's algortihm next time. | |||
Lecture 14 | Mo. 02/06/12 | Chapter 10: Design and Coding of Parallel Programs (Sorting) | |
Room 2404 | PDF transcript; video recording; class summary | ||
We discussed the sorting example in Chapter 10.
The narrow purpose is to show the process of designing
MPI code and to introduce a new MPI command MPI_Alltoall.
The broader purposes include introducing the v-versions
of many MPI commands and to point out how to find
MPI commands for particular purposes when needed.
We used a version that is simpler (since no dynamic memory
allocation is needed inside the algorithm; see note at end
of class.) Notice that we interrupted the presentation of Chapter 7 with this topic. The reason is that the code for Fox's algorithm uses some of the spirit presented here as well as some pieces from the lab tomorrow. | |||
Lab 8 | Tu. 02/07/12 | Discussion of HW 2 on Ringsend | |
Room 2421 | class summary | ||
We discussed Problem 4. of HW 2 from earlier in the semester. First, I feel that it needs to be stated clearly what we are asked to do: If we are Process id, then we are to send a message to the process with the next higher id number and to Process 0 if we are the process with the highed id number. This later fact is expressed in Pacheco by saying (id+1)%np with the mod operation, instead of just saying id+1, which would leave it open what to do for the process with the highest number id=np-1. This problem is known as a ringsend, and like the load-balancing code, this is useful code to have tested carefully for use in other problems, like Fox's algorithm in Ch. 7. Second, we jointly developed a solution on the board; key is merely to design code for the destination dest in the MPI_Send call and the source in MPI_Recv. I proceeded then to show several versions of my solution. It brings out that apparently working code might not actually be correct. I used this to motivate a number of lessons learnt:
| |||
Lecture 15 | We. 02/08/12 | Chapter 7: Fox's Algorithm | |
Room 1403 | PDF transcript; video recording; class summary | ||
We discussed the idea and implementation of Fox's algorithm. This is an example of two things: a block-matrix based calculation, like it is inside of BLAS and similar functions, and of the application of a Cartesian grid communicator. Notice that this is also the first time that our MPI commands do not use MPI_COMM_WORLD, but rather row_comm and col_comm. As the note at the end clarifies, my version of the code has differences to Pacheco's. | |||
Lab 9 | Fr. 02/10/12 | Discussion of HWs 4 and 5 on the Power Method | |
Room 2421 | driver_plot.m class summary | ||
This lab was a make-up for the class cancellation on 01/25/12. I presented the thinking of how you design the algorithm for the matrix-vector product that is at the heart of the algorithm. Then we looked at its implementation as well as the main program, utilities function, and the function for the power method. We looked at my results for various versions of the algorithm. We started with results for an inefficient ordering of the i-j-loops in the local matrix-vector product. This made an order of magnitude difference for the timings. The speedup and efficiency plots were better-than-optimal, which flags the case for having an inefficient one-process run. Then we looked at several other versions of the code, which for instance saved one or two of the MPI commands per iteration. We also looked at a version using BLAS and the MPI_Reduce_scatter command. We then compared to results obtained on the cluster tara in the UMBC High Performance Computing Facility (www.umbc.edu/hpcf). That cluster has a network with less latency, which was visible in the results by scaling to larger numbers of processes. I finally showed how I had used a Matlab code to assemble the results and produce the plots; see attached file driver_plot.m. | |||